An Intrinsic Nearest Neighbor Analysis of Neural Machine Translation Architectures

Christof Monz; Hamidreza Ghader

arxiv: 1907.03885 · v1 · pith:HUPTK6F6new · submitted 2019-07-08 · 💻 cs.CL · cs.LG· cs.NE

An Intrinsic Nearest Neighbor Analysis of Neural Machine Translation Architectures

Hamidreza Ghader , Christof Monz This is my paper

Pith reviewed 2026-05-25 00:49 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.NE

keywords neural machine translationtransformerrecurrent modelsnearest neighborslexical semanticssyntaxencoder hidden statesintrinsic evaluation

0 comments

The pith

Transformer encoders capture lexical semantics better than recurrent ones, while recurrent forward and backward layers split semantic and contextual roles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines encoder hidden states in transformer and recurrent neural machine translation models through a nearest-neighbor lens rather than feeding states into separate classifiers. It measures how often nearest neighbors align with word embeddings or WordNet entries for lexical semantics and checks syntactic structure overlap to assess syntax capture. Results show transformers align more closely on lexical semantics than recurrent models, matching earlier extrinsic findings, yet do not outperform on syntax. Within recurrent models the backward layer captures more word semantics while the forward layer encodes more surrounding context. This supplies an intrinsic comparison that directly inspects what the states represent.

Core claim

Analysis of nearest neighbors of encoder hidden states reveals that transformers are superior in capturing lexical semantics compared with recurrent models, as their neighbors share more information with the underlying word embeddings and related WordNet entries, yet transformers are not necessarily better at capturing the underlying syntax. In recurrent models the backward recurrent layer learns more about the semantics of words whereas the forward recurrent layer encodes more context.

What carries the argument

Nearest-neighbor search over encoder hidden states, compared to word embeddings, WordNet relations, and syntactic structure similarities.

If this is right

Transformers exhibit stronger lexical-semantic alignment than recurrent models.
Recurrent models divide labor such that the backward layer prioritizes word semantics and the forward layer prioritizes context.
Syntactic structure similarities contribute comparably to neighbor grouping in both architectures.
Intrinsic nearest-neighbor evaluation yields conclusions consistent with prior extrinsic classifier-based studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The split between layers in recurrent models suggests targeted use of unidirectional states for tasks that need either semantics or context.
Similar nearest-neighbor checks could be applied to other encoder-decoder setups to compare information distribution.
Model selection for translation or related tasks may benefit from matching architecture strengths to whether semantics or syntax dominates the required output.

Load-bearing premise

Proximity of hidden states under a nearest-neighbor metric directly reflects shared lexical or syntactic information rather than training artifacts or distance biases.

What would settle it

If nearest neighbors of hidden states show no greater overlap with WordNet synonyms or dependency-parse matches than randomly chosen words from the same corpus, the claim that the metric reveals linguistic capture would not hold.

read the original abstract

Earlier approaches indirectly studied the information captured by the hidden states of recurrent and non-recurrent neural machine translation models by feeding them into different classifiers. In this paper, we look at the encoder hidden states of both transformer and recurrent machine translation models from the nearest neighbors perspective. We investigate to what extent the nearest neighbors share information with the underlying word embeddings as well as related WordNet entries. Additionally, we study the underlying syntactic structure of the nearest neighbors to shed light on the role of syntactic similarities in bringing the neighbors together. We compare transformer and recurrent models in a more intrinsic way in terms of capturing lexical semantics and syntactic structures, in contrast to extrinsic approaches used by previous works. In agreement with the extrinsic evaluations in the earlier works, our experimental results show that transformers are superior in capturing lexical semantics, but not necessarily better in capturing the underlying syntax. Additionally, we show that the backward recurrent layer in a recurrent model learns more about the semantics of words, whereas the forward recurrent layer encodes more context.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a nearest-neighbor probe on NMT encoder states that mostly aligns with prior classifier results but leaves the meaning of the overlaps open to data artifacts.

read the letter

The main observation is that transformer encoder states show higher overlap with word embeddings and WordNet entries than RNN states, while syntactic structure in the neighbors looks comparable across architectures. They also report that the backward RNN layer aligns more with semantics and the forward layer with broader context. This is presented as an intrinsic alternative to the classifier-based probes in earlier work. The layer-specific split in the RNN case is a concrete detail that stands out as new within this line of analysis. The paper does a straightforward job of running the same nearest-neighbor check on both model families and tying the results back to the extrinsic findings it cites. That keeps the contribution focused and easy to compare. The soft spots sit in the interpretation and the missing experimental controls. The abstract states the outcomes without numbers, neighbor counts, distance choices, or checks against frequency or positional biases. The claim that proximity under nearest neighbors directly signals captured lexical or syntactic information is load-bearing, yet the stress-test concern about parallel-corpus artifacts is not obviously addressed. If the overlaps largely track regularities already present in the training data, the architecture differences become harder to isolate. This work is mainly for readers already following representation analysis in MT. It supplies one more data point on the same question rather than a shift in how we evaluate models. The thinking is clear and engages the cited literature without obvious internal contradictions. I would bring it to a reading group as maybe, to talk through whether the method adds enough beyond the classifier results. I would not cite it in my own work. It deserves peer review so the full experimental details and any controls can be examined.

Referee Report

2 major / 2 minor

Summary. The paper proposes an intrinsic nearest-neighbor analysis of encoder hidden states in transformer and recurrent NMT models. It measures overlap between nearest neighbors (under cosine/Euclidean distance) and external resources (word embeddings, WordNet) to assess capture of lexical semantics, and examines syntactic properties of the neighbors to evaluate syntactic structure capture. The central empirical claims are that transformers are superior to RNNs at lexical semantics but not necessarily at syntax, and that in bidirectional RNNs the backward layer encodes more semantics while the forward layer encodes more context.

Significance. If the nearest-neighbor overlaps can be shown to reflect model-internal representations rather than parallel-corpus regularities or metric artifacts, the work would supply a useful intrinsic counterpart to prior extrinsic classifier probes of NMT representations and could help explain architecture-specific performance differences on semantic versus syntactic tasks.

major comments (2)

[Abstract and §4 (Experiments)] Abstract and experimental results: the load-bearing interpretation that higher NN overlap demonstrates superior capture of lexical semantics or syntax lacks any described control (e.g., random encoder, frequency-matched baseline, or shuffled-label comparison) that would show the observed overlaps exceed those expected from any sufficiently expressive model trained on the same parallel data. Without such controls the forward/backward RNN split and transformer-vs-RNN comparison inherit the same ambiguity.
[Abstract] Abstract: the reported results supply no quantitative details (dataset sizes, number of sentences/words analyzed, distance metric, hidden-state dimensionality, error bars, or statistical tests), making it impossible to assess whether the claimed superiority of transformers on semantics or the forward/backward distinction is robust.

minor comments (2)

[Abstract] The abstract states conclusions but omits all numerical results, which is atypical for an empirical paper and hinders immediate evaluation.
[Method section] Notation for the distance metric and neighbor selection procedure should be defined explicitly before the results are presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below and outline the revisions we plan to make.

read point-by-point responses

Referee: [Abstract and §4 (Experiments)] Abstract and experimental results: the load-bearing interpretation that higher NN overlap demonstrates superior capture of lexical semantics or syntax lacks any described control (e.g., random encoder, frequency-matched baseline, or shuffled-label comparison) that would show the observed overlaps exceed those expected from any sufficiently expressive model trained on the same parallel data. Without such controls the forward/backward RNN split and transformer-vs-RNN comparison inherit the same ambiguity.

Authors: We agree that the absolute levels of NN overlap could benefit from baselines to rule out corpus artifacts. Our main findings are relative comparisons between models trained on the same data, which mitigates some concerns. Nevertheless, to strengthen the claims, we will include a random encoder baseline and a frequency-matched word baseline in the revised experiments section. This will allow us to demonstrate that the observed overlaps are indeed higher than expected from non-semantic models. revision: yes
Referee: [Abstract] Abstract: the reported results supply no quantitative details (dataset sizes, number of sentences/words analyzed, distance metric, hidden-state dimensionality, error bars, or statistical tests), making it impossible to assess whether the claimed superiority of transformers on semantics or the forward/backward distinction is robust.

Authors: The abstract prioritizes brevity, but we recognize the importance of key details for assessing robustness. We will revise the abstract to include the dataset (WMT14 English-German with approximately 4.5M sentences), the number of words analyzed (top 10k frequent words), the distance metric (cosine similarity), hidden state dimensions (512 for RNN, 512 for transformer), and note that results include standard deviations over multiple runs with statistical significance tests reported in Section 4. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical nearest-neighbor comparisons against external resources

full rationale

The paper conducts an empirical study of encoder hidden states via nearest-neighbor overlap with WordNet entries and word embeddings, plus syntactic analysis. No derivation, equation, or prediction reduces by construction to a fitted parameter, self-citation chain, or renamed input. Claims rest on direct experimental measurements against independent external benchmarks (WordNet, embeddings), with no self-definitional steps or load-bearing uniqueness theorems. This matches the default case of a self-contained empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the domain assumption that nearest-neighbor proximity in hidden-state space is a valid proxy for captured information; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Nearest neighbors in hidden-state space reflect the lexical semantics and syntactic information captured by the model
Core premise of the intrinsic analysis method; invoked when comparing neighbors to embeddings and WordNet entries.

pith-pipeline@v0.9.0 · 5702 in / 1188 out tokens · 24931 ms · 2026-05-25T00:49:01.312105+00:00 · methodology

An Intrinsic Nearest Neighbor Analysis of Neural Machine Translation Architectures

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)