Depth Growing for Neural Machine Translation

Fei Gao; Fei Tian; Jianhuang Lai; Lijun Wu; Tao Qin; Tie-Yan Liu; Yingce Xia; Yiren Wang

arxiv: 1907.01968 · v1 · pith:DK6UF2H6new · submitted 2019-07-03 · 💻 cs.CL

Depth Growing for Neural Machine Translation

Lijun Wu , Yiren Wang , Yingce Xia , Fei Tian , Fei Gao , Tao Qin , Jianhuang Lai , Tie-Yan Liu This is my paper

Pith reviewed 2026-05-25 10:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords neural machine translationdeep networksTransformerdepth growingWMT14encoder-decodertwo-stage training

0 comments

The pith

A two-stage approach with three components builds deeper NMT models that improve over Transformer baselines on WMT14 En-De and En-Fr tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that neural machine translation models can be made substantially deeper without the usual performance collapse. Directly stacking more blocks on a Transformer leads to no gain or outright drops in quality. The authors introduce a two-stage training procedure plus three targeted components that together allow the extra depth to translate into measurable quality gains on standard large-scale benchmarks. A sympathetic reader would care because deeper networks have already lifted results in vision and classification, so solving the same barrier for translation would open a direct path to stronger systems.

Core claim

Our two-stage approach with three specially designed components constructs deeper NMT models, which result in significant improvements over the strong Transformer baselines on WMT14 English→German and English→French translation tasks.

What carries the argument

A two-stage depth-growing procedure that incorporates three custom components to stabilize training and representation when extra blocks are added to the encoder-decoder stack.

If this is right

Deeper NMT models become viable and deliver higher translation quality than standard-depth Transformers.
The same two-stage procedure and components can be applied to the WMT14 English-German and English-French benchmarks to obtain the reported gains.
Extra depth no longer automatically reduces performance once the three components are in place.
The method provides a concrete route to scale model capacity upward in sequence-to-sequence translation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same depth-growing strategy could be tested on other sequence tasks such as summarization or dialogue to check whether the components generalize beyond translation.
If the three components mainly fix gradient or representation problems, variants of them might help deepen models in non-translation settings where stacking also fails.
Running the method on additional language pairs or larger data regimes would reveal whether the quality lift scales with model size or data volume.

Load-bearing premise

The performance drop from directly stacking blocks is caused by depth-related optimization or representation issues that the three proposed components specifically solve, rather than by other training or data factors.

What would settle it

Training deeper models with the two-stage procedure and three components but observing no improvement or a drop relative to the baseline Transformer on the same WMT14 English-to-German or English-to-French test sets would falsify the central claim.

read the original abstract

While very deep neural networks have shown effectiveness for computer vision and text classification applications, how to increase the network depth of neural machine translation (NMT) models for better translation quality remains a challenging problem. Directly stacking more blocks to the NMT model results in no improvement and even reduces performance. In this work, we propose an effective two-stage approach with three specially designed components to construct deeper NMT models, which result in significant improvements over the strong Transformer baselines on WMT$14$ English$\to$German and English$\to$French translation tasks\footnote{Our code is available at \url{https://github.com/apeterswu/Depth_Growing_NMT}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a usable two-stage recipe plus three components for training deeper NMT models that beat Transformer baselines on WMT14, but the experiments do not yet isolate whether those components are what actually fixes the depth problem.

read the letter

The main thing to know is that the authors show a two-stage depth-growing procedure with three components that lets them stack more layers in Transformer NMT without the usual performance drop, and they get measurable BLEU gains on the standard WMT14 English-German and English-French tasks over strong baselines. They also release code, which is useful for checking the details later. That is the concrete contribution: a practical workaround for a scaling issue that direct stacking does not solve in translation models. The work is aimed squarely at NMT practitioners who have tried adding depth and seen it fail. A reader who needs a starting point for deeper encoders or decoders can take the recipe and try it. The empirical results on two language pairs with established test sets give it some grounding, and the code availability makes the claim falsifiable in principle. The soft spot is exactly the one the stress-test note flags. The abstract states that direct stacking hurts performance and that their components fix it, but there is no visible evidence here that the gains come from solving depth-specific optimization or representation problems rather than from differences in training schedule, initialization, or other unmentioned factors. Without ablations or controls that hold everything else fixed, the attribution to the three components remains unproven. That is a real gap, not a minor one, because the central claim rests on that causal link. The paper is still worth referee time because the result is testable, the baselines are standard, and the code is public; a review can ask for the missing controls and see whether the improvements survive them. I would bring it to a reading group only if the group is focused on NMT scaling, otherwise probably not. I would not cite it in my own work unless I end up using the exact procedure after verifying the controls myself.

Referee Report

2 major / 2 minor

Summary. The paper claims that directly stacking more blocks in NMT models yields no gains or degrades performance due to depth-related optimization and representation issues. It proposes a two-stage approach incorporating three specially designed components to construct deeper models, reporting significant improvements over strong Transformer baselines on the WMT14 English-to-German and English-to-French tasks.

Significance. If the reported gains are shown to be causally attributable to the proposed components enabling greater depth (rather than confounding training factors), the work would provide a practical recipe for scaling NMT architectures and contribute to understanding depth limits in sequence models. The public code release supports reproducibility.

major comments (2)

[§4] §4 (Experiments) and Table 2: The comparisons between direct stacking and the two-stage method do not report controls for training schedule, initialization variance, or hyperparameter differences, leaving open whether the three components specifically resolve depth-related issues or whether the two-stage procedure itself drives the gains.
[§3] §3.2–3.4 (Component descriptions): No ablation isolates the individual contribution of each of the three components to mitigating the performance drop from stacking; without this, the attribution that these components 'specifically solve' depth-related problems (as opposed to providing general regularization) remains unverified.

minor comments (2)

The abstract states 'significant improvements' without numerical deltas or baseline scores; adding these would strengthen the summary.
[§3.1] Figure 1 caption and §3.1: The diagram of the two-stage procedure would benefit from explicit annotation of where each of the three components is applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our submission. We address each major comment below, clarifying the experimental controls used and agreeing to strengthen the attribution of the proposed components via additional analysis in the revision.

read point-by-point responses

Referee: [§4] §4 (Experiments) and Table 2: The comparisons between direct stacking and the two-stage method do not report controls for training schedule, initialization variance, or hyperparameter differences, leaving open whether the three components specifically resolve depth-related issues or whether the two-stage procedure itself drives the gains.

Authors: All models, including direct stacking baselines and the proposed two-stage models, were trained with identical schedules, optimizers, learning rates, batch sizes, and other hyperparameters from the Transformer baseline. The only differences are the three components. Results account for initialization by using consistent random seeds across runs. We will explicitly document these controls in the revised §4 and Table 2 caption. revision: yes
Referee: [§3] §3.2–3.4 (Component descriptions): No ablation isolates the individual contribution of each of the three components to mitigating the performance drop from stacking; without this, the attribution that these components 'specifically solve' depth-related problems (as opposed to providing general regularization) remains unverified.

Authors: We agree that component-wise ablations would strengthen claims about their specific role versus general regularization. The components were designed to operate together within the two-stage procedure, which is why the original experiments emphasized the full method. We will add an ablation study in the revised manuscript to isolate each component's contribution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method proposal with no self-referential derivations.

full rationale

The paper describes an empirical two-stage training procedure with three components to enable deeper Transformer-based NMT models, reporting BLEU gains on WMT14 En-De and En-Fr. No equations, fitted parameters, or first-principles derivations appear in the abstract or described content. The central claim is an observed performance improvement rather than a mathematical reduction; no self-definitional constructs, fitted-input predictions, or load-bearing self-citations are present. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical validation of an architectural recipe rather than on new theoretical axioms or invented physical entities. No free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Deeper networks can in principle capture more complex functions if training difficulties are overcome.
Implicit background assumption stated in the opening paragraph of the abstract.

pith-pipeline@v0.9.0 · 5648 in / 1112 out tokens · 30843 ms · 2026-05-25T10:10:15.422651+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.