MultiHashFormer enables hash-based autoregression in LMs by encoding tokens as multi-hash signatures, outperforming standard Transformers at 100M-3B scales while keeping parameter count constant for multilingual expansion.
hub Mixed citations
Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing
Mixed citation behavior. Most common role is background (44%).
abstract
This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
CircuitFormer is a 511M-parameter encoder-decoder model that generates analog circuit topologies from text prompts at 100% syntactic correctness and 83% functional success using a new subcircuit-mining tokenizer that keeps vocabulary size fixed at 512.
ReTokSync resolves tokenization ambiguity in generative linguistic steganography via targeted self-synchronizing resets, achieving over 99.7% extraction accuracy and 100% recovery with an auxiliary channel while matching baseline security and quality.
Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B parameters.
DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
VocabTailor introduces a decoupled dynamic vocabulary selection framework that reduces vocabulary-related memory in SLMs by up to 99% with minimal task performance loss.
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Position Interpolation linearly down-scales position indices to extend RoPE context windows to 32768 tokens with 1000-step fine-tuning, delivering strong long-context results on LLaMA 7B-65B while preserving short-context quality.
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.
Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.
Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and protein tasks.
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.
17th-century Italian imposes a 2.4x surprisal tax on LLMs versus modern Italian with comparable tokenization costs to Russian, yet embeddings stay robust above 0.85 similarity and a temporal prompt reduces surprisal by 60%.
CORE-BREW introduces constant-hit-rate embedding to produce LLRs enabling soft-decision decoding for more robust multi-bit LLM watermarking with two FPR-aware detection modes.
GiLT augments Transformers with semantic dependency graphs by modulating attention to improve syntactic generalization while keeping perplexity competitive and enabling better finetuning on downstream tasks.
A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
Later LLM layers align better with human cognitive effort in syntactic ambiguity than early layers do, indicating dual processing modes and complementary benefits from multi-layer probability updates.
citing papers explorer
-
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.