MultiHashFormer: Hash-based Generative Language Models

Atsuki Yamaguchi; Huiyin Xue; Nikolaos Aletras

arxiv: 2606.28057 · v1 · pith:4AT4N2UUnew · submitted 2026-06-26 · 💻 cs.CL · cs.AI· cs.LG

MultiHashFormer: Hash-based Generative Language Models

Huiyin Xue , Atsuki Yamaguchi , Nikolaos Aletras This is my paper

Pith reviewed 2026-06-29 04:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords hash-based embeddingsautoregressive language modelsparameter efficiencymultilingual modelsvocabulary expansiontransformer decoder

0 comments

The pith

MultiHashFormer represents each token as a unique sequence of hash IDs from multiple functions to enable autoregressive hash-based language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MultiHashFormer as a way to apply hashing to causal language models despite collisions that break standard approaches. Each token receives a short signature of discrete IDs from several independent hash functions. A Hash Encoder compresses the signature to a latent vector for Transformer processing, and a Hash Decoder predicts the next signature before mapping it back to text. Evaluations at 100M, 1B and 3B scales show consistent gains over standard Transformer LMs, and the model adds new languages to the vocabulary without any increase in parameters.

Core claim

Each token is represented as a unique hash signature consisting of a short sequence of discrete hash IDs generated by multiple independent hash functions. The Hash Encoder compresses this signature into a single latent vector fed to the Transformer decoder. The Hash Decoder then produces the hash signature of the next token, which is resolved back to the corresponding text token.

What carries the argument

Multi-hash signature: a sequence of hash IDs from independent functions that uniquely identifies a token, allowing compression to one embedding vector and reconstruction by the decoder.

If this is right

MultiHashFormer models at 100M, 1B and 3B parameters outperform standard Transformer LMs across multiple benchmarks.
Vocabulary expansion to include additional languages occurs with no change to the overall parameter count.
Embedding parameters no longer scale linearly with vocabulary size in generative settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The constant-parameter multilingual property could support models that dynamically incorporate new languages or domain-specific terms after initial training.
Error accumulation in the hash-decoding step might become measurable only at sequence lengths far beyond typical benchmarks.
The same multi-hash approach might combine with existing compression methods such as low-rank adapters to further reduce memory.

Load-bearing premise

That mapping a predicted hash signature back to a unique token remains reliable during autoregressive generation without ambiguity or systematic errors.

What would settle it

Generation runs in which the decoder produces hash signatures that map to the wrong token or collide with multiple tokens at a rate high enough to degrade output quality compared with standard models.

Figures

Figures reproduced from arXiv: 2606.28057 by Atsuki Yamaguchi, Huiyin Xue, Nikolaos Aletras.

**Figure 1.** Figure 1: Overview of the MULTIHASHFORMER framework using three different hash functions (for illustration purposes) with B buckets to obtain a multi-ID hash signature. natures. At the input, a gated compositional embedding compresses the multi-ID signature into a dense latent vector, providing the backbone with a unified representation for sequence processing (Hash Encoder). At the output, a cascaded predictor re… view at source ↗

**Figure 2.** Figure 2: MULTIHASHFORMER (blue shades, diamond marker) are similar or better than Standard (orange shades, circle marker) at 1B and 3B scales after expanding the vocabulary from 32K to 48K without adding a single new parameter. Black horizontal lines denote random baseline performance per task. Model Last Hidden States Second-to-Last Hidden States Pearson r ↑ Spearman ρ ↑ Pearson r ↑ Spearman ρ ↑ Starndard 0.20 0.2… view at source ↗

**Figure 3.** Figure 3: Average scores on four core capabilities across [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Language models (LMs) represent tokens using embedding matrices that scale linearly with the vocabulary size. To constrain the parameter footprint, prior work proposes hashing many tokens into a single vector within encoder-only models. While this offers parameter efficiency, many-to-one collisions prevent its use in causal LMs. In this paper, we propose MultiHashFormer, a new framework that allows hash-based autoregression. Each token is represented as a unique hash signature, a short sequence of discrete hash IDs, generated by multiple independent hash functions. A Hash Encoder compresses this signature into a single latent vector for processing by a Transformer decoder. Then, a Hash Decoder generates the hash signature of the next token, which is then mapped back to text. We evaluate our approach at the 100M, 1B and 3B parameter scales, demonstrating that MultiHashFormer consistently outperforms standard Transformer LMs across multiple benchmarks. Furthermore, we show that our model handles multilingual vocabulary expansion with a constant parameter footprint without any modifications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multi-hash signatures solve the collision barrier for causal LMs, but the decoder-to-token recovery step still needs real evidence before the performance claims land.

read the letter

The new piece is the use of multiple independent hash functions to build a short, collision-resistant signature per token, then splitting the model into a Hash Encoder that feeds the transformer and a Hash Decoder that predicts the next signature autoregressively. That setup is what lets them claim hashing works for generation where single-hash approaches failed.

They report consistent gains over standard transformers at 100M, 1B, and 3B scales plus constant-parameter multilingual expansion. If the numbers hold, that would be useful for anyone hitting embedding-table limits.

The soft spot is the inversion step the stress-test flags. The decoder emits discrete hash IDs; any mismatch produces either an invalid signature or the wrong token. The abstract gives no error rates, recovery accuracy, or behavior on unseen hash combinations, so it is not clear whether the reported benchmark wins survive that mapping. Without those numbers or ablations the central claim rests on an untested link.

The work is aimed at people building parameter-efficient LMs for large or multilingual vocabularies. A reader who wants to see whether the hash decoder actually delivers reliable next-token prediction would get value from the full paper. The idea engages the literature on hashing in encoders and tries to extend it, so it is coherent on its own terms.

I would send it to peer review so referees can check the experimental details and the recovery accuracy numbers.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes MultiHashFormer, a generative LM architecture in which each token is represented by a unique multi-hash signature (short sequence of discrete hash IDs from independent hash functions). A Hash Encoder compresses the signature to a latent vector processed by a Transformer decoder; a Hash Decoder then autoregressively predicts the next signature, which is mapped back to a token. The paper reports that this yields consistent outperformance versus standard Transformers at the 100M, 1B and 3B scales on multiple benchmarks, while also enabling multilingual vocabulary expansion at constant parameter count.

Significance. If the reported gains are reproducible and the hash-inversion step proves reliable at scale, the framework would address a long-standing barrier to parameter-efficient hashing in causal LMs and could materially reduce the embedding-matrix cost of large or multilingual vocabularies. The work is distinguished by its explicit focus on autoregressive generation rather than encoder-only settings.

major comments (2)

[Abstract] Abstract: the central empirical claim that MultiHashFormer 'consistently outperforms standard Transformer LMs across multiple benchmarks' at 100M–3B scales is presented without any metrics, baselines, ablation results, or error bars, so it is impossible to determine whether the reported advantage is load-bearing or an artifact of post-hoc choices.
[Abstract] Abstract: the claim of 'constant parameter footprint' multilingual expansion rests on the unexamined assumption that the Hash Decoder can produce signatures that invert to unique tokens without ambiguity or systematic error during autoregressive generation; no quantitative bound on signature-prediction error rate, collision rate on OOV hash combinations, or recovery accuracy is supplied, leaving the inversion step as the least-secured link in the argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below and will revise the manuscript to improve clarity and substantiation of the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim that MultiHashFormer 'consistently outperforms standard Transformer LMs across multiple benchmarks' at 100M–3B scales is presented without any metrics, baselines, ablation results, or error bars, so it is impossible to determine whether the reported advantage is load-bearing or an artifact of post-hoc choices.

Authors: We agree that the abstract would be strengthened by including specific metrics. The full manuscript reports detailed results including perplexity reductions and benchmark scores (e.g., on WikiText, C4, and multilingual tasks) with comparisons to standard Transformers at 100M, 1B, and 3B scales, plus ablations. In revision we will add concise quantitative highlights and note the presence of error bars from multiple runs to the abstract. revision: yes
Referee: [Abstract] Abstract: the claim of 'constant parameter footprint' multilingual expansion rests on the unexamined assumption that the Hash Decoder can produce signatures that invert to unique tokens without ambiguity or systematic error during autoregressive generation; no quantitative bound on signature-prediction error rate, collision rate on OOV hash combinations, or recovery accuracy is supplied, leaving the inversion step as the least-secured link in the argument.

Authors: The constant-parameter claim follows directly from replacing the embedding matrix with fixed-size hash functions and a decoder that outputs hash IDs rather than expanding vocabulary embeddings. The manuscript provides empirical evidence of successful multilingual expansion on held-out languages without parameter growth. We acknowledge that explicit quantitative bounds on per-step signature error, OOV collision rates, and end-to-end recovery accuracy are not tabulated in the current version; we will add these analyses (computed from the existing training runs) to the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with no derivation chain

full rationale

The paper introduces MultiHashFormer as a new framework for hash-based autoregressive LMs, with each token as a multi-hash signature processed by Hash Encoder/Decoder. All central claims (outperformance at 100M-3B scales, constant-footprint multilingual expansion) are presented as results of empirical evaluation on benchmarks. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. The inversion step from predicted hash signature to token is an architectural assumption, not a derivation that reduces to its own inputs by construction. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Review based on abstract only; no free parameters, axioms, or independent evidence for new components are described.

invented entities (2)

Hash Encoder no independent evidence
purpose: Compresses multi-hash signature into a single latent vector
New architectural component introduced to process hash signatures
Hash Decoder no independent evidence
purpose: Generates the hash signature of the next token from the Transformer output
New architectural component required for autoregressive hash-based generation

pith-pipeline@v0.9.1-grok · 5704 in / 1175 out tokens · 63614 ms · 2026-06-29T04:09:51.429714+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 41 canonical work pages · 5 internal anchors

[1]

Phi-4 Technical Report

Marah I Abdin, Jyoti Aneja, Harkirat S. Behl, S \' e bastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, and 8 others. 2024. https://doi.org/10.48550/ARXIV.2412....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.08905 2024
[2]

Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. 2021. https://arxiv.org/abs/2102.03315 Think you have solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge . CoRR, abs/2102.03315

arXiv 2021
[3]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. https://doi.org/10.1609/AAAI.V34I05.6239 PIQA: reasoning about physical commonsense in natural language . In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The...

work page doi:10.1609/aaai.v34i05.6239 2020
[4]

Trenton Bricken and Cengiz Pehlevan. 2021. https://proceedings.neurips.cc/paper/2021/hash/8171ac2c5544a5cb54ac0f38bf477af4-Abstract.html Attention approximates sparse distributed memory . In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pag...

2021
[5]

Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, and Furu Wei. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.148 Instruction pre-training: Language models are supervised multitask learners . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2529--2550, Miami, Florida, USA. Association for ...

work page doi:10.18653/v1/2024.emnlp-main.148 2024
[6]

Clark, Dan Garrette, Iulia Turc, and John Wieting

Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. 2022. https://doi.org/10.1162/tacl_a_00448 Canine: Pre-training an efficient tokenization-free encoder for language representation . Transactions of the Association for Computational Linguistics, 10:73--91

work page doi:10.1162/tacl_a_00448 2022
[7]

Mirrokni

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. https://doi.org/10.1145/997817.997857 Locality-sensitive hashing scheme based on p-stable distributions . In Proceedings of the 20th ACM Symposium on Computational Geometry, Brooklyn, New York, USA, June 8-11, 2004 , pages 253--262. ACM

work page doi:10.1145/997817.997857 2004
[8]

Bj \"o rn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian Kersting, and Samuel Weinbach. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1217 T - FREE : Subword tokenizer-free generative LLM s via sparse representations for memory-efficient embeddings . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pag...

work page doi:10.18653/v1/2024.emnlp-main.1217 2024
[9]

Olmo 3

Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee F. Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, and 48 others. 2025. https://doi.org/10.48550/ARXIV.2512.13961 Olm...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.13961 2025
[10]

Kuzman Ganchev and Mark Dredze. 2008. https://aclanthology.org/W08-0804/ Small statistical models by random feature mixing . In Proceedings of the ACL -08: HLT Workshop on Mobile Language Processing , pages 19--20, Columbus, Ohio. Association for Computational Linguistics

2008
[11]

Christoph Goller and Andreas K \" u chler. 1996. https://doi.org/10.1109/ICNN.1996.548916 Learning task-dependent distributed representations by backpropagation through structure . In Proceedings of International Conference on Neural Networks (ICNN'96), Washington, DC, USA, June 3-6, 1996, pages 347--352. IEEE

work page doi:10.1109/icnn.1996.548916 1996
[12]

Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, and Mingsheng Long. 2024. https://proceedings.mlr.press/v235/guo24e.html On the embedding collapse when scaling up recommendation models . In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , Proceedings of Machine Learning Research, pages 1...

2024
[13]

Wenhan Han, Yifan Zhang, Zhixun Chen, Binbin Li, Haobin Lin, Bingni Zhang, Taifeng Wang, Mykola Pechenizkiy, Meng Fang, and Yin Zheng. 2025. https://doi.org/10.48550/ARXIV.2506.19468 Mubench: Assessment of multilingual capabilities of large language models across 61 languages . CoRR, abs/2506.19468

work page doi:10.48550/arxiv.2506.19468 2025
[14]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ Measuring massive multitask language understanding . In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net

2021
[15]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, and 3 others. 2022. http://papers.nips.cc...

2022
[16]

Sukjun Hwang, Brandon Wang, and Albert Gu. 2025. https://doi.org/10.48550/ARXIV.2507.07955 Dynamic chunking for end-to-end hierarchical sequence modeling . CoRR, abs/2507.07955

work page doi:10.48550/arxiv.2507.07955 2025
[17]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \' e lio Renard Lavaud, Marie - Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \' e e Lacroix, and William El Sayed. 2023. https://doi.org/...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
[18]

Taku Kudo and John Richardson. 2018. https://doi.org/10.18653/v1/D18-2012 S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium. Association for Compu...

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
[19]

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. https://doi.org/10.18653/v1/D17-1082 RACE : Large-scale R e A ding comprehension dataset from examinations . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785--794, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/d17-1082 2017
[20]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. https://openreview.net/forum?id=H1eA7AEtvS ALBERT: A lite BERT for self-supervised learning of language representations . In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net

2020
[21]

Houyi Li, Wenzhen Zheng, Jingcheng Hu, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Shuigeng Zhou, Xiangyu Zhang, and Daxin Jiang. 2025. https://doi.org/10.48550/ARXIV.2503.04715 Predictable scale: Part I - optimal hyperparameter scaling law in large language model pretraining . CoRR, abs/2503.04715

work page doi:10.48550/arxiv.2503.04715 2025
[22]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. https://doi.org/10.18653/v1/2022.acl-long.229 T ruthful QA : Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214--3252, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.acl-long.229 2022
[23]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://doi.org/10.18653/v1/D18-1260 Can a suit of armor conduct electricity? a new dataset for open book question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381--2391, Brussels, Belgium. Association for Computational Li...

work page doi:10.18653/v1/d18-1260 2018
[24]

Smith, Edoardo M

Benjamin Minixhofer, Tyler Murray, Tomasz Limisiewicz, Anna Korhonen, Luke Zettlemoyer, Noah A. Smith, Edoardo M. Ponti, Luca Soldaini, and Valentin Hofmann. 2025. https://doi.org/10.48550/ARXIV.2512.15586 Bolmo: Byteifying the next generation of language models . CoRR, abs/2512.15586

work page doi:10.48550/arxiv.2512.15586 2025
[25]

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. https://doi.org/10.18653/v1/N16-1098 A corpus and cloze evaluation for deeper understanding of commonsense stories . In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computation...

work page doi:10.18653/v1/n16-1098 2016
[26]

Nandini Mundra, Aditya Nanda Kishore Khandavally, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, and Mitesh M Khapra. 2024. https://doi.org/10.18653/v1/2024.conll-1.8 An empirical comparison of vocabulary expansion and initialization approaches for language models . In Proceedings of the 28th Conference on Computational Natural Language Learning, pages...

work page doi:10.18653/v1/2024.conll-1.8 2024
[27]

Itay Nakash, Nitay Calderon, Eyal Ben - David, Elad Hoffer, and Roi Reichart. 2025. https://doi.org/10.48550/ARXIV.2503.19693 Adaptivocab: Enhancing LLM efficiency in focused domains through lightweight vocabulary adaptation . CoRR, abs/2503.19693

work page doi:10.48550/arxiv.2503.19693 2025
[28]

Abraham Toluwase Owodunni and Sachin Kumar. 2025. https://doi.org/10.48550/ARXIV.2509.11414 Continually adding new languages to multilingual language models . CoRR, abs/2509.11414

work page doi:10.48550/arxiv.2509.11414 2025
[29]

Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srini Iyer. 2025. https://doi.org/10.18653/v1/2025.acl-long.453 Byte latent transformer: Patches scale better than tokens . In Proceedings of the 63rd Annual ...

work page doi:10.18653/v1/2025.acl-long.453 2025
[30]

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. 2016. https://doi.org/10.18653/v1/P16-1144 The LAMBADA dataset: Word prediction requiring a broad discourse context . In Proceedings of the 54th Annual Meeting of the Association for Computati...

work page doi:10.18653/v1/p16-1144 2016
[31]

Raffel, Leandro von Werra, and Thomas Wolf

Guilherme Penedo, Hynek Kydl \' cek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin A. Raffel, Leandro von Werra, and Thomas Wolf. 2024. http://papers.nips.cc/paper\_files/paper/2024/hash/370df50ccfdf8bde18f8f9c2d9151bda-Abstract-Datasets\_and\_Benchmarks\_Track.html The fineweb datasets: Decanting the web for the finest text data at scale . In...

2024
[32]

Guilherme Penedo, Hynek Kydl \' cek, Vinko Sabolcec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro von Werra, and Thomas Wolf. 2025. https://doi.org/10.48550/ARXIV.2506.20920 Fineweb2: One pipeline to scale them all - adapting pre-training data processing to every language . CoRR, abs/2506.20920

work page doi:10.48550/arxiv.2506.20920 2025
[33]

Mohammad Taher Pilehvar, Dimitri Kartsaklis, Victor Prokhorov, and Nigel Collier. 2018. https://doi.org/10.18653/v1/D18-1169 Card-660: C ambridge rare word dataset - a reliable benchmark for infrequent word representation models . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1391--1401, Brussels, Belgium...

work page doi:10.18653/v1/d18-1169 2018
[34]

Prafull Prakash, Saurabh Kumar Shashidhar, Wenlong Zhao, Subendhu Rongali, Haidar Khan, and Michael Kayser. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.423 Compressing transformer-based semantic parsing models using compositional code embeddings . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4711--4717, Online....

work page doi:10.18653/v1/2020.findings-emnlp.423 2020
[35]

Ofir Press and Lior Wolf. 2017. https://aclanthology.org/E17-2025/ Using the output embedding to improve language models . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 157--163, Valencia, Spain. Association for Computational Linguistics

2017
[36]

Fran c ois Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya Khabibullina, Miryam de Lhoneux, and Thomas Demeester. 2024. https://doi.org/10.48550/ARXIV.2408.04303 Trans-tokenization and cross-lingual vocabulary transfers: Language adaptation of llms for low-resource NLP . CoRR, abs/2408.04303

work page doi:10.48550/arxiv.2408.04303 2024
[37]

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. 2011. http://www.aaai.org/ocs/index.php/SSS/SSS11/paper/view/2418 Choice of plausible alternatives: An evaluation of commonsense causal reasoning . In Logical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford, California, USA...

2011
[38]

Chinnadhurai Sankar, Sujith Ravi, and Zornitsa Kozareva. 2021. https://doi.org/10.18653/v1/2021.eacl-main.246 P ro F ormer: Towards on-device LSH projection based transformers . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2823--2828, Online. Association for Computationa...

work page doi:10.18653/v1/2021.eacl-main.246 2021
[39]

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. https://doi.org/10.18653/v1/D19-1454 Social IQ a: Commonsense reasoning about social interactions . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),...

work page doi:10.18653/v1/d19-1454 2019
[40]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1162 Neural machine translation of rare words with subword units . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715--1725, Berlin, Germany. Association for Computational Linguistics

work page doi:10.18653/v1/p16-1162 2016
[41]

Hajime Senuma. 2025. https://doi.org/10.21105/JOSS.06124 mmh3: A python extension for murmurhash3 . J. Open Source Softw., 10(105):6124

work page doi:10.21105/joss.06124 2025
[42]

Raphael Shu and Hideki Nakayama. 2018. https://openreview.net/forum?id=BJRZzFlRb Compressing word embeddings via deep compositional code learning . In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings . OpenReview.net

2018
[43]

Skarda and Walter J

Christine A. Skarda and Walter J. Freeman. 1987. How brains make chaos in order to make sense of the world. Behavioral and Brain Sciences, 10(2):161--173

1987
[44]

Ng, and Christopher D

Richard Socher, Cliff Chiung - Yu Lin, Andrew Y. Ng, and Christopher D. Manning. 2011. https://icml.cc/2011/papers/125\_icmlpaper.pdf Parsing natural scenes and natural language with recursive neural networks . In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011 , pages 129-...

2011
[45]

Dan Svenstrup, Jonas Meinertz Hansen, and Ole Winther. 2017. https://proceedings.neurips.cc/paper/2017/hash/f0f6ba4b5e0000340312d33c212c3ae8-Abstract.html Hash embeddings for efficient word representations . In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Be...

2017
[46]

Qwen Team. 2025. https://doi.org/10.48550/ARXIV.2505.09388 Qwen3 technical report . CoRR, abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[47]

Atula Tejaswi, Nilesh Gupta, and Eunsol Choi. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.614 Exploring design choices for building language-specific LLM s . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10485--10500, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-emnlp.614 2024
[48]

Ichiro Tsuda. 2001. https://doi.org/10.1017/S0140525X01000097 Toward an interpretation of dynamic neural activity in terms of chaotic dynamical systems . Behavioral and Brain Sciences, 24(5):793–810

work page doi:10.1017/s0140525x01000097 2001
[49]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html Attention is all you need . In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Proces...

2017
[50]

Mathurin Videau, Badr Youbi Idrissi, Alessandro Ferreira Leite, Marc Schoenauer, Olivier Teytaud, and David Lopez - Paz. 2025. https://doi.org/10.48550/ARXIV.2506.14761 From bytes to ideas: Language modeling with autoregressive u-nets . CoRR, abs/2506.14761

work page doi:10.48550/arxiv.2506.14761 2025
[51]

and Gardner, Matt

Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017. https://doi.org/10.18653/v1/W17-4413 Crowdsourcing multiple choice science questions . In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94--106, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/w17-4413 2017
[52]

Huiyin Xue and Nikolaos Aletras. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.536 H ash F ormers: Towards vocabulary-independent pre-trained transformers . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7862--7874, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

work page doi:10.18653/v1/2022.emnlp-main.536 2022
[53]

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. 2022. https://doi.org/10.1162/tacl_a_00461 B y T 5: Towards a token-free future with pre-trained byte-to-byte models . Transactions of the Association for Computational Linguistics, 10:291--306

work page doi:10.1162/tacl_a_00461 2022
[54]

Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, and Nikolaos Aletras. 2025. https://openreview.net/forum?id=6IdoIKowfe Adapting chat language models using only target unlabeled language data . Trans. Mach. Learn. Res., 2025

2025
[55]

Atsuki Yamaguchi, Aline Villavicencio, and Nikolaos Aletras. 2026. https://doi.org/10.1162/COLI.a.581 How can we effectively expand the vocabulary of LLM s with 0.01 GB of target language text? Computational Linguistics, 52(1):295--330

work page doi:10.1162/coli.a.581 2026
[56]

Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018. https://openreview.net/forum?id=HkwZSG-CZ Breaking the softmax bottleneck: A high-rank RNN language model . In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings . OpenReview.net

2018
[57]

Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong, and Furu Wei. 2021. https://doi.org/10.18653/v1/2021.findings-acl.40 Adapt-and-distill: Developing small, fast and effective pretrained language models for domains . In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 460--470, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2021.findings-acl.40 2021
[58]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. https://doi.org/10.18653/v1/P19-1472 H ella S wag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1472 2019
[59]

Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. https://arxiv.org/abs/1810.12885 Record: Bridging the gap between human and machine commonsense reading comprehension . CoRR, abs/1810.12885

Pith/arXiv arXiv 2018

[1] [1]

Phi-4 Technical Report

Marah I Abdin, Jyoti Aneja, Harkirat S. Behl, S \' e bastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, and 8 others. 2024. https://doi.org/10.48550/ARXIV.2412....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.08905 2024

[2] [2]

Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. 2021. https://arxiv.org/abs/2102.03315 Think you have solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge . CoRR, abs/2102.03315

arXiv 2021

[3] [3]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. https://doi.org/10.1609/AAAI.V34I05.6239 PIQA: reasoning about physical commonsense in natural language . In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The...

work page doi:10.1609/aaai.v34i05.6239 2020

[4] [4]

Trenton Bricken and Cengiz Pehlevan. 2021. https://proceedings.neurips.cc/paper/2021/hash/8171ac2c5544a5cb54ac0f38bf477af4-Abstract.html Attention approximates sparse distributed memory . In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pag...

2021

[5] [5]

Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, and Furu Wei. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.148 Instruction pre-training: Language models are supervised multitask learners . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2529--2550, Miami, Florida, USA. Association for ...

work page doi:10.18653/v1/2024.emnlp-main.148 2024

[6] [6]

Clark, Dan Garrette, Iulia Turc, and John Wieting

Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. 2022. https://doi.org/10.1162/tacl_a_00448 Canine: Pre-training an efficient tokenization-free encoder for language representation . Transactions of the Association for Computational Linguistics, 10:73--91

work page doi:10.1162/tacl_a_00448 2022

[7] [7]

Mirrokni

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. 2004. https://doi.org/10.1145/997817.997857 Locality-sensitive hashing scheme based on p-stable distributions . In Proceedings of the 20th ACM Symposium on Computational Geometry, Brooklyn, New York, USA, June 8-11, 2004 , pages 253--262. ACM

work page doi:10.1145/997817.997857 2004

[8] [8]

Bj \"o rn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian Kersting, and Samuel Weinbach. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1217 T - FREE : Subword tokenizer-free generative LLM s via sparse representations for memory-efficient embeddings . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pag...

work page doi:10.18653/v1/2024.emnlp-main.1217 2024

[9] [9]

Olmo 3

Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee F. Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, and 48 others. 2025. https://doi.org/10.48550/ARXIV.2512.13961 Olm...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.13961 2025

[10] [10]

Kuzman Ganchev and Mark Dredze. 2008. https://aclanthology.org/W08-0804/ Small statistical models by random feature mixing . In Proceedings of the ACL -08: HLT Workshop on Mobile Language Processing , pages 19--20, Columbus, Ohio. Association for Computational Linguistics

2008

[11] [11]

Christoph Goller and Andreas K \" u chler. 1996. https://doi.org/10.1109/ICNN.1996.548916 Learning task-dependent distributed representations by backpropagation through structure . In Proceedings of International Conference on Neural Networks (ICNN'96), Washington, DC, USA, June 3-6, 1996, pages 347--352. IEEE

work page doi:10.1109/icnn.1996.548916 1996

[12] [12]

Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, and Mingsheng Long. 2024. https://proceedings.mlr.press/v235/guo24e.html On the embedding collapse when scaling up recommendation models . In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , Proceedings of Machine Learning Research, pages 1...

2024

[13] [13]

Wenhan Han, Yifan Zhang, Zhixun Chen, Binbin Li, Haobin Lin, Bingni Zhang, Taifeng Wang, Mykola Pechenizkiy, Meng Fang, and Yin Zheng. 2025. https://doi.org/10.48550/ARXIV.2506.19468 Mubench: Assessment of multilingual capabilities of large language models across 61 languages . CoRR, abs/2506.19468

work page doi:10.48550/arxiv.2506.19468 2025

[14] [14]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ Measuring massive multitask language understanding . In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net

2021

[15] [15]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, and 3 others. 2022. http://papers.nips.cc...

2022

[16] [16]

Sukjun Hwang, Brandon Wang, and Albert Gu. 2025. https://doi.org/10.48550/ARXIV.2507.07955 Dynamic chunking for end-to-end hierarchical sequence modeling . CoRR, abs/2507.07955

work page doi:10.48550/arxiv.2507.07955 2025

[17] [17]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \' e lio Renard Lavaud, Marie - Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \' e e Lacroix, and William El Sayed. 2023. https://doi.org/...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023

[18] [18]

Taku Kudo and John Richardson. 2018. https://doi.org/10.18653/v1/D18-2012 S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium. Association for Compu...

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018

[19] [19]

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. https://doi.org/10.18653/v1/D17-1082 RACE : Large-scale R e A ding comprehension dataset from examinations . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785--794, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/d17-1082 2017

[20] [20]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. https://openreview.net/forum?id=H1eA7AEtvS ALBERT: A lite BERT for self-supervised learning of language representations . In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net

2020

[21] [21]

Houyi Li, Wenzhen Zheng, Jingcheng Hu, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Shuigeng Zhou, Xiangyu Zhang, and Daxin Jiang. 2025. https://doi.org/10.48550/ARXIV.2503.04715 Predictable scale: Part I - optimal hyperparameter scaling law in large language model pretraining . CoRR, abs/2503.04715

work page doi:10.48550/arxiv.2503.04715 2025

[22] [22]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. https://doi.org/10.18653/v1/2022.acl-long.229 T ruthful QA : Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214--3252, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.acl-long.229 2022

[23] [23]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://doi.org/10.18653/v1/D18-1260 Can a suit of armor conduct electricity? a new dataset for open book question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381--2391, Brussels, Belgium. Association for Computational Li...

work page doi:10.18653/v1/d18-1260 2018

[24] [24]

Smith, Edoardo M

Benjamin Minixhofer, Tyler Murray, Tomasz Limisiewicz, Anna Korhonen, Luke Zettlemoyer, Noah A. Smith, Edoardo M. Ponti, Luca Soldaini, and Valentin Hofmann. 2025. https://doi.org/10.48550/ARXIV.2512.15586 Bolmo: Byteifying the next generation of language models . CoRR, abs/2512.15586

work page doi:10.48550/arxiv.2512.15586 2025

[25] [25]

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. https://doi.org/10.18653/v1/N16-1098 A corpus and cloze evaluation for deeper understanding of commonsense stories . In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computation...

work page doi:10.18653/v1/n16-1098 2016

[26] [26]

Nandini Mundra, Aditya Nanda Kishore Khandavally, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, and Mitesh M Khapra. 2024. https://doi.org/10.18653/v1/2024.conll-1.8 An empirical comparison of vocabulary expansion and initialization approaches for language models . In Proceedings of the 28th Conference on Computational Natural Language Learning, pages...

work page doi:10.18653/v1/2024.conll-1.8 2024

[27] [27]

Itay Nakash, Nitay Calderon, Eyal Ben - David, Elad Hoffer, and Roi Reichart. 2025. https://doi.org/10.48550/ARXIV.2503.19693 Adaptivocab: Enhancing LLM efficiency in focused domains through lightweight vocabulary adaptation . CoRR, abs/2503.19693

work page doi:10.48550/arxiv.2503.19693 2025

[28] [28]

Abraham Toluwase Owodunni and Sachin Kumar. 2025. https://doi.org/10.48550/ARXIV.2509.11414 Continually adding new languages to multilingual language models . CoRR, abs/2509.11414

work page doi:10.48550/arxiv.2509.11414 2025

[29] [29]

Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srini Iyer. 2025. https://doi.org/10.18653/v1/2025.acl-long.453 Byte latent transformer: Patches scale better than tokens . In Proceedings of the 63rd Annual ...

work page doi:10.18653/v1/2025.acl-long.453 2025

[30] [30]

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. 2016. https://doi.org/10.18653/v1/P16-1144 The LAMBADA dataset: Word prediction requiring a broad discourse context . In Proceedings of the 54th Annual Meeting of the Association for Computati...

work page doi:10.18653/v1/p16-1144 2016

[31] [31]

Raffel, Leandro von Werra, and Thomas Wolf

Guilherme Penedo, Hynek Kydl \' cek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin A. Raffel, Leandro von Werra, and Thomas Wolf. 2024. http://papers.nips.cc/paper\_files/paper/2024/hash/370df50ccfdf8bde18f8f9c2d9151bda-Abstract-Datasets\_and\_Benchmarks\_Track.html The fineweb datasets: Decanting the web for the finest text data at scale . In...

2024

[32] [32]

Guilherme Penedo, Hynek Kydl \' cek, Vinko Sabolcec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro von Werra, and Thomas Wolf. 2025. https://doi.org/10.48550/ARXIV.2506.20920 Fineweb2: One pipeline to scale them all - adapting pre-training data processing to every language . CoRR, abs/2506.20920

work page doi:10.48550/arxiv.2506.20920 2025

[33] [33]

Mohammad Taher Pilehvar, Dimitri Kartsaklis, Victor Prokhorov, and Nigel Collier. 2018. https://doi.org/10.18653/v1/D18-1169 Card-660: C ambridge rare word dataset - a reliable benchmark for infrequent word representation models . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1391--1401, Brussels, Belgium...

work page doi:10.18653/v1/d18-1169 2018

[34] [34]

Prafull Prakash, Saurabh Kumar Shashidhar, Wenlong Zhao, Subendhu Rongali, Haidar Khan, and Michael Kayser. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.423 Compressing transformer-based semantic parsing models using compositional code embeddings . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4711--4717, Online....

work page doi:10.18653/v1/2020.findings-emnlp.423 2020

[35] [35]

Ofir Press and Lior Wolf. 2017. https://aclanthology.org/E17-2025/ Using the output embedding to improve language models . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 157--163, Valencia, Spain. Association for Computational Linguistics

2017

[36] [36]

Fran c ois Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya Khabibullina, Miryam de Lhoneux, and Thomas Demeester. 2024. https://doi.org/10.48550/ARXIV.2408.04303 Trans-tokenization and cross-lingual vocabulary transfers: Language adaptation of llms for low-resource NLP . CoRR, abs/2408.04303

work page doi:10.48550/arxiv.2408.04303 2024

[37] [37]

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. 2011. http://www.aaai.org/ocs/index.php/SSS/SSS11/paper/view/2418 Choice of plausible alternatives: An evaluation of commonsense causal reasoning . In Logical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford, California, USA...

2011

[38] [38]

Chinnadhurai Sankar, Sujith Ravi, and Zornitsa Kozareva. 2021. https://doi.org/10.18653/v1/2021.eacl-main.246 P ro F ormer: Towards on-device LSH projection based transformers . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2823--2828, Online. Association for Computationa...

work page doi:10.18653/v1/2021.eacl-main.246 2021

[39] [39]

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. https://doi.org/10.18653/v1/D19-1454 Social IQ a: Commonsense reasoning about social interactions . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),...

work page doi:10.18653/v1/d19-1454 2019

[40] [40]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1162 Neural machine translation of rare words with subword units . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715--1725, Berlin, Germany. Association for Computational Linguistics

work page doi:10.18653/v1/p16-1162 2016

[41] [41]

Hajime Senuma. 2025. https://doi.org/10.21105/JOSS.06124 mmh3: A python extension for murmurhash3 . J. Open Source Softw., 10(105):6124

work page doi:10.21105/joss.06124 2025

[42] [42]

Raphael Shu and Hideki Nakayama. 2018. https://openreview.net/forum?id=BJRZzFlRb Compressing word embeddings via deep compositional code learning . In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings . OpenReview.net

2018

[43] [43]

Skarda and Walter J

Christine A. Skarda and Walter J. Freeman. 1987. How brains make chaos in order to make sense of the world. Behavioral and Brain Sciences, 10(2):161--173

1987

[44] [44]

Ng, and Christopher D

Richard Socher, Cliff Chiung - Yu Lin, Andrew Y. Ng, and Christopher D. Manning. 2011. https://icml.cc/2011/papers/125\_icmlpaper.pdf Parsing natural scenes and natural language with recursive neural networks . In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011 , pages 129-...

2011

[45] [45]

Dan Svenstrup, Jonas Meinertz Hansen, and Ole Winther. 2017. https://proceedings.neurips.cc/paper/2017/hash/f0f6ba4b5e0000340312d33c212c3ae8-Abstract.html Hash embeddings for efficient word representations . In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Be...

2017

[46] [46]

Qwen Team. 2025. https://doi.org/10.48550/ARXIV.2505.09388 Qwen3 technical report . CoRR, abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[47] [47]

Atula Tejaswi, Nilesh Gupta, and Eunsol Choi. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.614 Exploring design choices for building language-specific LLM s . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10485--10500, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-emnlp.614 2024

[48] [48]

Ichiro Tsuda. 2001. https://doi.org/10.1017/S0140525X01000097 Toward an interpretation of dynamic neural activity in terms of chaotic dynamical systems . Behavioral and Brain Sciences, 24(5):793–810

work page doi:10.1017/s0140525x01000097 2001

[49] [49]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html Attention is all you need . In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Proces...

2017

[50] [50]

Mathurin Videau, Badr Youbi Idrissi, Alessandro Ferreira Leite, Marc Schoenauer, Olivier Teytaud, and David Lopez - Paz. 2025. https://doi.org/10.48550/ARXIV.2506.14761 From bytes to ideas: Language modeling with autoregressive u-nets . CoRR, abs/2506.14761

work page doi:10.48550/arxiv.2506.14761 2025

[51] [51]

and Gardner, Matt

Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017. https://doi.org/10.18653/v1/W17-4413 Crowdsourcing multiple choice science questions . In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94--106, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/w17-4413 2017

[52] [52]

Huiyin Xue and Nikolaos Aletras. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.536 H ash F ormers: Towards vocabulary-independent pre-trained transformers . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7862--7874, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

work page doi:10.18653/v1/2022.emnlp-main.536 2022

[53] [53]

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. 2022. https://doi.org/10.1162/tacl_a_00461 B y T 5: Towards a token-free future with pre-trained byte-to-byte models . Transactions of the Association for Computational Linguistics, 10:291--306

work page doi:10.1162/tacl_a_00461 2022

[54] [54]

Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, and Nikolaos Aletras. 2025. https://openreview.net/forum?id=6IdoIKowfe Adapting chat language models using only target unlabeled language data . Trans. Mach. Learn. Res., 2025

2025

[55] [55]

Atsuki Yamaguchi, Aline Villavicencio, and Nikolaos Aletras. 2026. https://doi.org/10.1162/COLI.a.581 How can we effectively expand the vocabulary of LLM s with 0.01 GB of target language text? Computational Linguistics, 52(1):295--330

work page doi:10.1162/coli.a.581 2026

[56] [56]

Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018. https://openreview.net/forum?id=HkwZSG-CZ Breaking the softmax bottleneck: A high-rank RNN language model . In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings . OpenReview.net

2018

[57] [57]

Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong, and Furu Wei. 2021. https://doi.org/10.18653/v1/2021.findings-acl.40 Adapt-and-distill: Developing small, fast and effective pretrained language models for domains . In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 460--470, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2021.findings-acl.40 2021

[58] [58]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. https://doi.org/10.18653/v1/P19-1472 H ella S wag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1472 2019

[59] [59]

Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. https://arxiv.org/abs/1810.12885 Record: Bridging the gap between human and machine commonsense reading comprehension . CoRR, abs/1810.12885

Pith/arXiv arXiv 2018