SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

John Richardson; Taku Kudo

arxiv: 1808.06226 · v1 · submitted 2018-08-19 · 💻 cs.CL

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Taku Kudo , John Richardson This is my paper

Pith reviewed 2026-05-12 20:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords subword tokenizationneural machine translationlanguage independentend-to-end processingSentencePiece

0 comments

The pith

SentencePiece trains subword models directly from raw sentences without pre-tokenization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SentencePiece is a subword tokenizer and detokenizer that learns its segmentation models straight from raw sentences rather than requiring pre-tokenized word sequences. This removes the need for language-specific word segmentation tools and supports fully end-to-end neural text processing pipelines such as machine translation. The authors demonstrate on an English-Japanese translation task that the approach reaches accuracy levels comparable to standard subword training methods that start from tokenized input. Open-source C++ and Python implementations are provided to allow direct use in neural models.

Core claim

SentencePiece trains subword segmentation models directly from raw sentences, enabling purely end-to-end and language-independent neural text processing systems while maintaining comparable performance to pre-tokenized methods.

What carries the argument

The SentencePiece trainer, which builds subword units by processing raw sentence data without any prior word-level tokenization step.

Load-bearing premise

That the comparable accuracy seen in one English-Japanese neural machine translation experiment will hold for other language pairs, tasks, and model architectures.

What would settle it

An experiment on a different language pair such as English-Chinese that shows substantially lower translation quality when using SentencePiece compared with pre-tokenized subword baselines.

read the original abstract

This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SentencePiece gives you subword training and segmentation straight from raw sentences with released code and a basic NMT check.

read the letter

The main thing here is that SentencePiece trains subword models directly on raw sentences instead of requiring pre-tokenized word sequences first. That removes a common language-specific preprocessing step and supports more end-to-end setups, especially for multilingual work. The paper backs this with open-source C++ and Python code under Apache 2, plus a straightforward English-Japanese NMT experiment that reports comparable BLEU scores to the usual approach. They also include some comparisons across different training and segmentation settings. The implementation looks practical and the code release lets others verify the claims directly. The experiment is narrow though, limited to one language pair and one task, so it leaves open how well the results hold for other languages, directions, or tasks like language modeling. This is not a theoretical advance but an engineering convenience that fills a real gap in the tooling. It is aimed at people building neural text pipelines who want to skip extra preprocessing layers. Practitioners working on NMT or similar systems would get immediate value from trying the GitHub repo. I would send it for peer review because the code is there to test and the core claim is concrete and falsifiable.

Referee Report

0 major / 2 minor

Summary. The paper introduces SentencePiece, a language-independent subword tokenizer and detokenizer for neural text processing including NMT. It trains and segments subword units directly from raw sentences without pre-tokenization, provides open-source C++ and Python implementations under the Apache 2 license, and validates the approach via an English-Japanese NMT experiment showing comparable accuracy along with comparisons of various subword training configurations.

Significance. If the result holds, the work supplies a practical tool that enables purely end-to-end neural pipelines without language-specific preprocessing steps. The release of working open-source code together with a concrete NMT experiment that reports comparable BLEU scores is a clear strength supporting reproducibility and adoption in the community.

minor comments (2)

Abstract: the statement that the method achieves 'comparable accuracy to direct subword training from raw sentences' is slightly ambiguous as to the exact baselines and metrics; specifying the BLEU scores and comparison methods would make the summary more self-contained.
Experimental section: while the single English-Japanese NMT validation supports the feasibility claim, the manuscript would benefit from a table summarizing the various configurations tested and their performance differences for easier reference.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review, the recognition of the tool's practical value for end-to-end neural pipelines, and the recommendation to accept. We are pleased that the open-source release and reproducibility via the NMT experiment were noted as strengths.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an engineering implementation of SentencePiece for direct subword training and segmentation on raw sentences, with a single empirical NMT validation on English-Japanese showing comparable accuracy to baselines. No derivation chain, equations, fitted parameters renamed as predictions, or load-bearing self-citations exist; the central feasibility claim is grounded in the reported experiment and open-source code rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software tool paper describing an implementation of existing subword algorithms (BPE and unigram LM). No new mathematical axioms, free parameters, or invented entities are introduced beyond standard algorithmic choices already present in the cited literature.

pith-pipeline@v0.9.0 · 5430 in / 936 out tokens · 38709 ms · 2026-05-12T20:02:20.409334+00:00 · methodology

discussion (0)

Forward citations

Cited by 48 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
cs.LG 2026-05 unverdicted novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
CircuitFormer: A Circuit Language Model for Analog Topology Design from Natural Language Prompt
cs.AI 2026-05 unverdicted novelty 7.0

CircuitFormer is a 511M-parameter encoder-decoder model that generates analog circuit topologies from text prompts at 100% syntactic correctness and 83% functional success using a new subcircuit-mining tokenizer that ...
ReTokSync: Self-Synchronizing Tokenization Disambiguation for Generative Linguistic Steganography
cs.CR 2026-04 unverdicted novelty 7.0

ReTokSync resolves tokenization ambiguity in generative linguistic steganography via targeted self-synchronizing resets, achieving over 99.7% extraction accuracy and 100% recovery with an auxiliary channel while match...
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
cs.CL 2026-04 unverdicted novelty 7.0

Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts
cs.CL 2026-04 unverdicted novelty 7.0

MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B pa...
Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation
cs.CV 2026-03 unverdicted novelty 7.0

DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
cs.CL 2025-12 unverdicted novelty 7.0

Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models
cs.CL 2025-08 unverdicted novelty 7.0

VocabTailor introduces a decoupled dynamic vocabulary selection framework that reduces vocabulary-related memory in SLMs by up to 99% with minimal task performance loss.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Extending Context Window of Large Language Models via Positional Interpolation
cs.CL 2023-06 conditional novelty 7.0

Position Interpolation linearly down-scales position indices to extend RoPE context windows to 32768 tokens with 1000-step fine-tuning, delivering strong long-context results on LLaMA 7B-65B while preserving short-con...
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Improving language models by retrieving from trillions of tokens
cs.CL 2021-12 unverdicted novelty 7.0

RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.
The Power of Scale for Parameter-Efficient Prompt Tuning
cs.CL 2021-04 unverdicted novelty 7.0

Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.
Rethinking Attention with Performers
cs.LG 2020-09 unverdicted novelty 7.0

Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and prote...
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
cs.CL 2019-09 accept novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
XLNet: Generalized Autoregressive Pretraining for Language Understanding
cs.CL 2019-06 accept novelty 7.0

XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.
GiLT: Augmenting Transformer Language Models with Dependency Graphs
cs.CL 2026-05 unverdicted novelty 6.0

GiLT augments Transformers with semantic dependency graphs by modulating attention to improve syntactic generalization while keeping perplexity competitive and enabling better finetuning on downstream tasks.
Predicting Large Model Test Losses with a Noisy Quadratic System
cs.LG 2026-05 unverdicted novelty 6.0

A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
Dual Alignment Between Language Model Layers and Human Sentence Processing
cs.CL 2026-04 unverdicted novelty 6.0

Later LLM layers align better with human cognitive effort in syntactic ambiguity than early layers do, indicating dual processing modes and complementary benefits from multi-layer probability updates.
Accelerating Vision Transformers with Adaptive Patch Sizes
cs.CV 2025-10 conditional novelty 6.0

APT adaptively varies patch sizes within a single image to reduce ViT token count, delivering 40-50% throughput gains on large models with no downstream performance loss.
Multi Language Models for On-the-Fly Syntax Highlighting
cs.SE 2025-10 unverdicted novelty 6.0

Unified multi-language deep learning model for on-the-fly syntax highlighting using normalization and few-shot learning to support six languages with lower deployment cost.
How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP
cs.CL 2024-11 unverdicted novelty 6.0

The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
An Empirical Study of Mamba-based Language Models
cs.LG 2024-06 accept novelty 6.0

An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
Chameleon: Mixed-Modal Early-Fusion Foundation Models
cs.CL 2024-05 unverdicted novelty 6.0

Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
Gemini: A Family of Highly Capable Multimodal Models
cs.CL 2023-12 conditional novelty 6.0

Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
BloombergGPT: A Large Language Model for Finance
cs.LG 2023-03 conditional novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
cs.CL 2022-11 unverdicted novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
cs.CV 2022-06 unverdicted novelty 6.0

Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
cs.CL 2021-09 conditional novelty 6.0

CodeT5 adds identifier-aware pre-training and bimodal dual generation to a T5-style encoder-decoder, yielding better results on defect detection, clone detection, and code-to-text, text-to-code, and code-to-code tasks...
Program Synthesis with Large Language Models
cs.PL 2021-08 unverdicted novelty 6.0

Large language models synthesize Python code from descriptions with log-linear scaling in performance, reaching 59.6% on MBPP via few-shot prompting and 83.8% on MathQA-Python after fine-tuning, while human feedback h...
Budgeted Dynamic Trace Structures for Token-Efficient Sequential Computation
cs.DC 2026-05 unverdicted novelty 5.0

BDTS is a new data-structural framework for budgeted maintenance of rooted trace graphs, with Rust benchmarks showing compaction of 350k-2.71M tokens to 1k-4k tokens and model input reduction from ~3360 to ~432 tokens.
The Impact of Vocabulary Overlaps on Knowledge Transfer in Multilingual Machine Translation
cs.CL 2026-05 unverdicted novelty 5.0

Experiments show domain match and language relatedness drive knowledge transfer in multilingual MT more than vocabulary overlap.
Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective
cs.CR 2026-04 unverdicted novelty 5.0

BPE tokenization creates gibberish bias in CLLMs, causing secrets with high character entropy but low token entropy to be preferentially memorized due to training data distribution shifts.
Digital Skin, Digital Bias: Uncovering Tone-Based Biases in LLMs and Emoji Embeddings
cs.SI 2026-04 unverdicted novelty 5.0

LLMs handle skin tone emoji modifiers better than dedicated embedding models but display systemic disparities in sentiment and semantic consistency across tones.
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
cs.CV 2023-12 unverdicted novelty 5.0

MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetso...
PaLM 2 Technical Report
cs.CL 2023-05 unverdicted novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
Continuous diffusion for categorical data
cs.CL 2022-11 unverdicted novelty 5.0

The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.
LLM-Safety Evaluations Lack Robustness
cs.CR 2025-03 unverdicted novelty 4.0

LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
Gemma: Open Models Based on Gemini Research and Technology
cs.CL 2024-03 accept novelty 4.0

Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
Yi: Open Foundation Models by 01.AI
cs.CL 2024-03 unverdicted novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
Baichuan 2: Open Large-scale Language Models
cs.CL 2023-09 unverdicted novelty 4.0

Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.
Gemma 2: Improving Open Language Models at a Practical Size
cs.CL 2024-07 conditional novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways
cs.CL 2023-10 unverdicted novelty 3.0

Lil-Bevo applies music pretraining, curriculum learning on sequence length, and targeted masking to small LMs in the BabyLM challenge, finding modest gains from short sequences but overall limited performance.
Applying a Pre-trained Language Model to Spanish Twitter Humor Prediction
cs.CL 2019-07 unverdicted novelty 3.0

A Spanish Twitter language model trained from scratch with label smoothing placed 3rd and 2nd in the HAHA 2019 humor classification and regression tasks.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 48 Pith papers · 3 internal anchors

[1]

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2017. Unsupervised neural machine translation. arXive preprint arXiv:1710.11041

work page Pith review arXiv 2017
[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014
[3]

Michael Denkowski and Graham Neubig. 2017. Stronger baselines for trustable results in neural machine translation. Proc. of Workshop on Neural Machine Translation

work page 2017
[4]

Melvin Johnson, Mike Schuster, et al. 2016. Google's multilingual neural machine translation system: enabling zero-shot translation. arXiv preprint arXiv:1611.04558

work page Pith review arXiv 2016
[5]

Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proc. of ACL

work page 2018
[6]

Guillaume Lample, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2017. Unsupervised machine translation using monolingual corpora only. arXive preprint arXiv:1711.00043

work page Pith review arXiv 2017
[7]

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. In Proc of EMNLP

work page 2015
[8]

Toshiaki Nakazawa, Shohei Higashiyama, et al. 2017. Overview of the 4th workshop on asian translation. In Proceedings of the 4th Workshop on Asian Translation (WAT2017)

work page 2017
[9]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL

work page 2002
[10]

Matt Post. 2018. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771

work page Pith review arXiv 2018
[11]

Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proc. of EMNLP

work page 2015
[12]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proc. of ACL

work page 2016
[13]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXive preprint arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Oriol Vinyals and Quoc V. Le. 2015. A neural conversational model. In ICML Deep Learning Workshop

work page 2015
[15]

Yonghui Wu, Mike Schuster, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2017. Unsupervised neural machine translation. arXive preprint arXiv:1710.11041

work page Pith review arXiv 2017

[2] [2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

work page internal anchor Pith review Pith/arXiv arXiv 2014

[3] [3]

Michael Denkowski and Graham Neubig. 2017. Stronger baselines for trustable results in neural machine translation. Proc. of Workshop on Neural Machine Translation

work page 2017

[4] [4]

Melvin Johnson, Mike Schuster, et al. 2016. Google's multilingual neural machine translation system: enabling zero-shot translation. arXiv preprint arXiv:1611.04558

work page Pith review arXiv 2016

[5] [5]

Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proc. of ACL

work page 2018

[6] [6]

Guillaume Lample, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2017. Unsupervised machine translation using monolingual corpora only. arXive preprint arXiv:1711.00043

work page Pith review arXiv 2017

[7] [7]

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. In Proc of EMNLP

work page 2015

[8] [8]

Toshiaki Nakazawa, Shohei Higashiyama, et al. 2017. Overview of the 4th workshop on asian translation. In Proceedings of the 4th Workshop on Asian Translation (WAT2017)

work page 2017

[9] [9]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL

work page 2002

[10] [10]

Matt Post. 2018. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771

work page Pith review arXiv 2018

[11] [11]

Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proc. of EMNLP

work page 2015

[12] [12]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proc. of ACL

work page 2016

[13] [13]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXive preprint arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Oriol Vinyals and Quoc V. Le. 2015. A neural conversational model. In ICML Deep Learning Workshop

work page 2015

[15] [15]

Yonghui Wu, Mike Schuster, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144

work page internal anchor Pith review Pith/arXiv arXiv 2016