hub Mixed citations

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Taku Kudo, John Richardson · 2018 · cs.CL · DOI 10.18653/v1/d18-2012 · arXiv 1808.06226

Mixed citation behavior. Most common role is background (44%).

54 Pith papers citing it

1,189 external citations · Crossref

Background 44% of classified citations

open full Pith review browse 54 citing papers arXiv PDF

abstract

This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 4 method 3 dataset 1 other 1

citation-polarity summary

background 4 use method 3 unclear 1 use dataset 1

representative citing papers

MultiHashFormer: Hash-based Generative Language Models

cs.CL · 2026-06-26 · unverdicted · novelty 7.0

MultiHashFormer enables hash-based autoregression in LMs by encoding tokens as multi-hash signatures, outperforming standard Transformers at 100M-3B scales while keeping parameter count constant for multilingual expansion.

MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.

From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.

CircuitFormer: A Circuit Language Model for Analog Topology Design from Natural Language Prompt

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

CircuitFormer is a 511M-parameter encoder-decoder model that generates analog circuit topologies from text prompts at 100% syntactic correctness and 83% functional success using a new subcircuit-mining tokenizer that keeps vocabulary size fixed at 512.

ReTokSync: Self-Synchronizing Tokenization Disambiguation for Generative Linguistic Steganography

cs.CR · 2026-04-28 · unverdicted · novelty 7.0

ReTokSync resolves tokenization ambiguity in generative linguistic steganography via targeted self-synchronizing resets, achieving over 99.7% extraction accuracy and 100% recovery with an auxiliary channel while matching baseline security and quality.

How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

cs.CL · 2026-04-18 · unverdicted · novelty 7.0

Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.

MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts

cs.CL · 2026-04-13 · unverdicted · novelty 7.0

MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B parameters.

Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

cs.CV · 2026-03-03 · unverdicted · novelty 7.0

DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

cs.CL · 2025-12-18 · unverdicted · novelty 7.0

Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.

VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models

cs.CL · 2025-08-21 · unverdicted · novelty 7.0

VocabTailor introduces a decoupled dynamic vocabulary selection framework that reduces vocabulary-related memory in SLMs by up to 99% with minimal task performance loss.

Moshi: a speech-text foundation model for real-time dialogue

eess.AS · 2024-09-17 · accept · novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

Extending Context Window of Large Language Models via Positional Interpolation

cs.CL · 2023-06-27 · conditional · novelty 7.0

Position Interpolation linearly down-scales position indices to extend RoPE context windows to 32768 tokens with 1000-step fine-tuning, delivering strong long-context results on LLaMA 7B-65B while preserving short-context quality.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Improving language models by retrieving from trillions of tokens

cs.CL · 2021-12-08 · unverdicted · novelty 7.0

RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.

The Power of Scale for Parameter-Efficient Prompt Tuning

cs.CL · 2021-04-18 · unverdicted · novelty 7.0

Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.

Rethinking Attention with Performers

cs.LG · 2020-09-30 · unverdicted · novelty 7.0

Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and protein tasks.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

cs.LG · 2019-10-23 · unverdicted · novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

cs.CL · 2019-09-26 · accept · novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

cs.CL · 2019-06-19 · accept · novelty 7.0

XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.

How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation

cs.CL · 2026-06-25 · unverdicted · novelty 6.0

17th-century Italian imposes a 2.4x surprisal tax on LLMs versus modern Italian with comparable tokenization costs to Russian, yet embeddings stay robust above 0.85 similarity and a temporal prompt reduces surprisal by 60%.

CORE-BREW: LLR-Based Soft Decoding for Robust Multi-Bit LLM Watermarking

cs.CR · 2026-06-23 · unverdicted · novelty 6.0

CORE-BREW introduces constant-hit-rate embedding to produce LLRs enabling soft-decision decoding for more robust multi-bit LLM watermarking with two FPR-aware detection modes.

GiLT: Augmenting Transformer Language Models with Dependency Graphs

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

GiLT augments Transformers with semantic dependency graphs by modulating attention to improve syntactic generalization while keeping perplexity competitive and enabling better finetuning on downstream tasks.

Predicting Large Model Test Losses with a Noisy Quadratic System

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.

Dual Alignment Between Language Model Layers and Human Sentence Processing

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

Later LLM layers align better with human cognitive effort in syntactic ambiguity than early layers do, indicating dual processing modes and complementary benefits from multi-layer probability updates.

citing papers explorer

Showing 50 of 54 citing papers.

MultiHashFormer: Hash-based Generative Language Models cs.CL · 2026-06-26 · unverdicted · none · ref 18 · internal anchor
MultiHashFormer enables hash-based autoregression in LMs by encoding tokens as multi-hash signatures, outperforming standard Transformers at 100M-3B scales while keeping parameter count constant for multilingual expansion.
MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment cs.CL · 2026-06-25 · unverdicted · none · ref 16 · internal anchor
MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models cs.LG · 2026-05-11 · unverdicted · none · ref 36 · internal anchor
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
CircuitFormer: A Circuit Language Model for Analog Topology Design from Natural Language Prompt cs.AI · 2026-05-07 · unverdicted · none · ref 20 · internal anchor
CircuitFormer is a 511M-parameter encoder-decoder model that generates analog circuit topologies from text prompts at 100% syntactic correctness and 83% functional success using a new subcircuit-mining tokenizer that keeps vocabulary size fixed at 512.
ReTokSync: Self-Synchronizing Tokenization Disambiguation for Generative Linguistic Steganography cs.CR · 2026-04-28 · unverdicted · none · ref 11 · internal anchor
ReTokSync resolves tokenization ambiguity in generative linguistic steganography via targeted self-synchronizing resets, achieving over 99.7% extraction accuracy and 100% recovery with an auxiliary channel while matching baseline security and quality.
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them cs.CL · 2026-04-18 · unverdicted · none · ref 32 · internal anchor
Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts cs.CL · 2026-04-13 · unverdicted · none · ref 12 · internal anchor
MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B parameters.
Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation cs.CV · 2026-03-03 · unverdicted · none · ref 12 · internal anchor
DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs cs.CL · 2025-12-18 · unverdicted · none · ref 53 · internal anchor
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models cs.CL · 2025-08-21 · unverdicted · none · ref 10 · internal anchor
VocabTailor introduces a decoupled dynamic vocabulary selection framework that reduces vocabulary-related memory in SLMs by up to 99% with minimal task performance loss.
Moshi: a speech-text foundation model for real-time dialogue eess.AS · 2024-09-17 · accept · none · ref 50 · internal anchor
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Extending Context Window of Large Language Models via Positional Interpolation cs.CL · 2023-06-27 · conditional · none · ref 11 · internal anchor
Position Interpolation linearly down-scales position indices to extend RoPE context windows to 32768 tokens with 1000-step fine-tuning, delivering strong long-context results on LLaMA 7B-65B while preserving short-context quality.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 245 · internal anchor
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Improving language models by retrieving from trillions of tokens cs.CL · 2021-12-08 · unverdicted · none · ref 119 · internal anchor
RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.
The Power of Scale for Parameter-Efficient Prompt Tuning cs.CL · 2021-04-18 · unverdicted · none · ref 83 · internal anchor
Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.
Rethinking Attention with Performers cs.LG · 2020-09-30 · unverdicted · none · ref 135 · internal anchor
Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and protein tasks.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer cs.LG · 2019-10-23 · unverdicted · none · ref 41 · internal anchor
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations cs.CL · 2019-09-26 · accept · none · ref 21 · internal anchor
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
XLNet: Generalized Autoregressive Pretraining for Language Understanding cs.CL · 2019-06-19 · accept · none · ref 17 · internal anchor
XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.
How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation cs.CL · 2026-06-25 · unverdicted · none · ref 6 · internal anchor
17th-century Italian imposes a 2.4x surprisal tax on LLMs versus modern Italian with comparable tokenization costs to Russian, yet embeddings stay robust above 0.85 similarity and a temporal prompt reduces surprisal by 60%.
CORE-BREW: LLR-Based Soft Decoding for Robust Multi-Bit LLM Watermarking cs.CR · 2026-06-23 · unverdicted · none · ref 13 · internal anchor
CORE-BREW introduces constant-hit-rate embedding to produce LLRs enabling soft-decision decoding for more robust multi-bit LLM watermarking with two FPR-aware detection modes.
GiLT: Augmenting Transformer Language Models with Dependency Graphs cs.CL · 2026-05-15 · unverdicted · none · ref 30 · internal anchor
GiLT augments Transformers with semantic dependency graphs by modulating attention to improve syntactic generalization while keeping perplexity competitive and enabling better finetuning on downstream tasks.
Predicting Large Model Test Losses with a Noisy Quadratic System cs.LG · 2026-05-09 · unverdicted · none · ref 30 · internal anchor
A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
Dual Alignment Between Language Model Layers and Human Sentence Processing cs.CL · 2026-04-20 · unverdicted · none · ref 61 · internal anchor
Later LLM layers align better with human cognitive effort in syntactic ambiguity than early layers do, indicating dual processing modes and complementary benefits from multi-layer probability updates.
Accelerating Vision Transformers with Adaptive Patch Sizes cs.CV · 2025-10-20 · conditional · none · ref 11 · internal anchor
APT adaptively varies patch sizes within a single image to reduce ViT token count, delivering 40-50% throughput gains on large models with no downstream performance loss.
Multi Language Models for On-the-Fly Syntax Highlighting cs.SE · 2025-10-05 · unverdicted · none · ref 16 · internal anchor
Unified multi-language deep learning model for on-the-fly syntax highlighting using normalization and few-shot learning to support six languages with lower deployment cost.
How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP cs.CL · 2024-11-08 · unverdicted · none · ref 30 · internal anchor
The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
An Empirical Study of Mamba-based Language Models cs.LG · 2024-06-12 · accept · none · ref 28 · internal anchor
An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
Chameleon: Mixed-Modal Early-Fusion Foundation Models cs.CL · 2024-05-16 · unverdicted · none · ref 16 · internal anchor
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro on captioning, VQA, text, and image tasks.
Gemini: A Family of Highly Capable Multimodal Models cs.CL · 2023-12-19 · conditional · none · ref 47 · internal anchor
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
BloombergGPT: A Large Language Model for Finance cs.LG · 2023-03-30 · conditional · none · ref 55 · internal anchor
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model cs.CL · 2022-11-09 · unverdicted · none · ref 257 · internal anchor
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation cs.CV · 2022-06-22 · unverdicted · none · ref 33 · internal anchor
Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 74 · internal anchor
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
ST-MoE: Designing Stable and Transferable Sparse Expert Models cs.CL · 2022-02-17 · unverdicted · none · ref 163 · internal anchor
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation cs.CL · 2021-09-02 · conditional · none · ref 71 · internal anchor
CodeT5 adds identifier-aware pre-training and bimodal dual generation to a T5-style encoder-decoder, yielding better results on defect detection, clone detection, and code-to-text, text-to-code, and code-to-code tasks than prior encoder-only or decoder-only models.
Program Synthesis with Large Language Models cs.PL · 2021-08-16 · unverdicted · none · ref 4 · internal anchor
Large language models synthesize Python code from descriptions with log-linear scaling in performance, reaching 59.6% on MBPP via few-shot prompting and 83.8% on MathQA-Python after fine-tuning, while human feedback halves error rates but models fail at predicting program outputs.
Toten: A Knowledge-Based System For Structure-Preserving Representation Of Physical Quantities And Technical Notation In Brazilian Portuguese cs.AI · 2026-06-17 · unverdicted · none · ref 2 · internal anchor
TOTEN is a knowledge-based system for structure-preserving representation of physical quantities and technical notation in Brazilian Portuguese using an ontology of engineering entities and external authorities, outperforming statistical baselines in atomicity and reconstruction.
Budgeted Dynamic Trace Structures for Token-Efficient Sequential Computation cs.DC · 2026-05-20 · unverdicted · none · ref 13 · internal anchor
BDTS is a new data-structural framework for budgeted maintenance of rooted trace graphs, with Rust benchmarks showing compaction of 350k-2.71M tokens to 1k-4k tokens and model input reduction from ~3360 to ~432 tokens.
The Impact of Vocabulary Overlaps on Knowledge Transfer in Multilingual Machine Translation cs.CL · 2026-05-05 · unverdicted · none · ref 15 · internal anchor
Experiments show domain match and language relatedness drive knowledge transfer in multilingual MT more than vocabulary overlap.
Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective cs.CR · 2026-04-20 · unverdicted · none · ref 43 · internal anchor
BPE tokenization creates gibberish bias in CLLMs, causing secrets with high character entropy but low token entropy to be preferentially memorized due to training data distribution shifts.
Digital Skin, Digital Bias: Uncovering Tone-Based Biases in LLMs and Emoji Embeddings cs.SI · 2026-04-08 · unverdicted · none · ref 19 · internal anchor
LLMs handle skin tone emoji modifiers better than dedicated embedding models but display systemic disparities in sentiment and semantic consistency across tones.
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices cs.CV · 2023-12-28 · unverdicted · none · ref 63 · internal anchor
MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetson Orin GPU.
PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 261 · internal anchor
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
Continuous diffusion for categorical data cs.CL · 2022-11-28 · unverdicted · none · ref 52 · internal anchor
The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.
LLM-Safety Evaluations Lack Robustness cs.CR · 2025-03-04 · unverdicted · none · ref 35 · internal anchor
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
PaliGemma: A versatile 3B VLM for transfer cs.CV · 2024-07-10 · unverdicted · none · ref 68 · internal anchor
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
Gemma: Open Models Based on Gemini Research and Technology cs.CL · 2024-03-13 · accept · none · ref 100 · internal anchor
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 42 · internal anchor
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
Baichuan 2: Open Large-scale Language Models cs.CL · 2023-09-19 · unverdicted · none · ref 39 · internal anchor
Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer