pith. sign in

Byte pair encoding is suboptimal for language model pretraining

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

citation-role summary

method 1

citation-polarity summary

years

2026 7 2023 1

roles

method 1

polarities

use method 1

clear filters

representative citing papers

LangMAP: A Language-Adaptive Approach to Tokenization

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

LangMAP adapts UnigramLM for multilingual use to deliver language-specific tokenization from a shared vocabulary, boosting boundary alignment metrics across natural and programming languages with mixed downstream fine-tuning gains.

Tokenization with Split Trees

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

ToaST uses vocabulary-independent split trees and integer programming to produce tokenizers with over 11% fewer tokens than BPE, WordPiece, and UnigramLM while improving 1.5B-parameter LM scores on CORE.

Rubato: Transcribing Piano Music with Timestamps

cs.SD · 2026-05-22 · unverdicted · novelty 6.0

Rubato model with InterMo representation outperforms cascade methods in generating timestamped piano sheet music from audio, even when cascades receive ground-truth MIDI.

BloombergGPT: A Large Language Model for Finance

cs.LG · 2023-03-30 · conditional · novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • BloombergGPT: A Large Language Model for Finance cs.LG · 2023-03-30 · conditional · none · ref 12

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.