MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.
Byte pair encoding is suboptimal for language model pretraining
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
LangMAP adapts UnigramLM for multilingual use to deliver language-specific tokenization from a shared vocabulary, boosting boundary alignment metrics across natural and programming languages with mixed downstream fine-tuning gains.
ToaST uses vocabulary-independent split trees and integer programming to produce tokenizers with over 11% fewer tokens than BPE, WordPiece, and UnigramLM while improving 1.5B-parameter LM scores on CORE.
Rubato model with InterMo representation outperforms cascade methods in generating timestamped piano sheet music from audio, even when cascades receive ground-truth MIDI.
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
TOTEN is a knowledge-based system for structure-preserving representation of physical quantities and technical notation in Brazilian Portuguese using an ontology of engineering entities and external authorities, outperforming statistical baselines in atomicity and reconstruction.
A literature survey synthesizing benchmarks, architectures, training strategies, and evaluation methods for mathematical reasoning in LLMs, based on roughly 120 papers.
citing papers explorer
-
MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment
MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.
-
LangMAP: A Language-Adaptive Approach to Tokenization
LangMAP adapts UnigramLM for multilingual use to deliver language-specific tokenization from a shared vocabulary, boosting boundary alignment metrics across natural and programming languages with mixed downstream fine-tuning gains.
-
Tokenization with Split Trees
ToaST uses vocabulary-independent split trees and integer programming to produce tokenizers with over 11% fewer tokens than BPE, WordPiece, and UnigramLM while improving 1.5B-parameter LM scores on CORE.
-
Rubato: Transcribing Piano Music with Timestamps
Rubato model with InterMo representation outperforms cascade methods in generating timestamped piano sheet music from audio, even when cascades receive ground-truth MIDI.
-
Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
-
Toten: A Knowledge-Based System For Structure-Preserving Representation Of Physical Quantities And Technical Notation In Brazilian Portuguese
TOTEN is a knowledge-based system for structure-preserving representation of physical quantities and technical notation in Brazilian Portuguese using an ontology of engineering entities and external authorities, outperforming statistical baselines in atomicity and reconstruction.
-
Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges
A literature survey synthesizing benchmarks, architectures, training strategies, and evaluation methods for mathematical reasoning in LLMs, based on roughly 120 papers.