Byte pair encoding is suboptimal for language model pretraining

Kaj Bostrom, Greg Durrett · 2020 · DOI 10.18653/v1/2020.findings-emnlp.414

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open at publisher browse 8 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.

LangMAP: A Language-Adaptive Approach to Tokenization

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

LangMAP adapts UnigramLM for multilingual use to deliver language-specific tokenization from a shared vocabulary, boosting boundary alignment metrics across natural and programming languages with mixed downstream fine-tuning gains.

Tokenization with Split Trees

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

ToaST uses vocabulary-independent split trees and integer programming to produce tokenizers with over 11% fewer tokens than BPE, WordPiece, and UnigramLM while improving 1.5B-parameter LM scores on CORE.

Rubato: Transcribing Piano Music with Timestamps

cs.SD · 2026-05-22 · unverdicted · novelty 6.0

Rubato model with InterMo representation outperforms cascade methods in generating timestamped piano sheet music from audio, even when cascades receive ground-truth MIDI.

Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis

eess.SP · 2026-05-16 · unverdicted · novelty 6.0

Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.

BloombergGPT: A Large Language Model for Finance

cs.LG · 2023-03-30 · conditional · novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

Toten: A Knowledge-Based System For Structure-Preserving Representation Of Physical Quantities And Technical Notation In Brazilian Portuguese

cs.AI · 2026-06-17 · unverdicted · novelty 5.0

TOTEN is a knowledge-based system for structure-preserving representation of physical quantities and technical notation in Brazilian Portuguese using an ontology of engineering entities and external authorities, outperforming statistical baselines in atomicity and reconstruction.

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

cs.CL · 2026-05-19 · unverdicted · novelty 3.0

A literature survey synthesizing benchmarks, architectures, training strategies, and evaluation methods for mathematical reasoning in LLMs, based on roughly 120 papers.

citing papers explorer

Showing 7 of 7 citing papers after filters.

MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment cs.CL · 2026-06-25 · unverdicted · none · ref 6
MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.
LangMAP: A Language-Adaptive Approach to Tokenization cs.CL · 2026-06-22 · unverdicted · none · ref 35
LangMAP adapts UnigramLM for multilingual use to deliver language-specific tokenization from a shared vocabulary, boosting boundary alignment metrics across natural and programming languages with mixed downstream fine-tuning gains.
Tokenization with Split Trees cs.CL · 2026-05-21 · unverdicted · none · ref 35
ToaST uses vocabulary-independent split trees and integer programming to produce tokenizers with over 11% fewer tokens than BPE, WordPiece, and UnigramLM while improving 1.5B-parameter LM scores on CORE.
Rubato: Transcribing Piano Music with Timestamps cs.SD · 2026-05-22 · unverdicted · none · ref 28
Rubato model with InterMo representation outperforms cascade methods in generating timestamped piano sheet music from audio, even when cascades receive ground-truth MIDI.
Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis eess.SP · 2026-05-16 · unverdicted · none · ref 166
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
Toten: A Knowledge-Based System For Structure-Preserving Representation Of Physical Quantities And Technical Notation In Brazilian Portuguese cs.AI · 2026-06-17 · unverdicted · none · ref 16
TOTEN is a knowledge-based system for structure-preserving representation of physical quantities and technical notation in Brazilian Portuguese using an ontology of engineering entities and external authorities, outperforming statistical baselines in atomicity and reconstruction.
Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges cs.CL · 2026-05-19 · unverdicted · none · ref 8
A literature survey synthesizing benchmarks, architectures, training strategies, and evaluation methods for mathematical reasoning in LLMs, based on roughly 120 papers.

Byte pair encoding is suboptimal for language model pretraining

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer