Hierarchical Document Encoder for Parallel Corpus Mining
Pith reviewed 2026-05-25 20:13 UTC · model grok-4.3
The pith
Hierarchical document encoder HiDE outperforms averaging and bag-of-words baselines for parallel document mining, reaching 94.9% P@1 on en-fr and 97.3% P@1 on en-es UN data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using document embeddings trained with HiDE achieves state-of-the-art performance on United Nations (UN) parallel document mining, 94.9% P@1 for en-fr and 97.3% P@1 for en-es.
Load-bearing premise
That the reported gains on noisy data come from the hierarchical document-level training rather than from other unstated differences in model capacity or data filtering (abstract provides no training details or ablation controls).
read the original abstract
We explore using multilingual document embeddings for nearest neighbor mining of parallel data. Three document-level representations are investigated: (i) document embeddings generated by simply averaging multilingual sentence embeddings; (ii) a neural bag-of-words (BoW) document encoding model; (iii) a hierarchical multilingual document encoder (HiDE) that builds on our sentence-level model. The results show document embeddings derived from sentence-level averaging are surprisingly effective for clean datasets, but suggest models trained hierarchically at the document-level are more effective on noisy data. Analysis experiments demonstrate our hierarchical models are very robust to variations in the underlying sentence embedding quality. Using document embeddings trained with HiDE achieves state-of-the-art performance on United Nations (UN) parallel document mining, 94.9% P@1 for en-fr and 97.3% P@1 for en-es.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper explores multilingual document embeddings for nearest-neighbor mining of parallel corpora. It compares three document representations—simple averaging of multilingual sentence embeddings, a neural bag-of-words model, and a hierarchical document encoder (HiDE) built on a sentence-level model—and reports that averaging works well on clean data while HiDE is more effective and robust on noisy data. HiDE is claimed to reach state-of-the-art P@1 of 94.9% (en-fr) and 97.3% (en-es) on United Nations parallel document mining.
Significance. If the experimental comparisons hold after proper controls for capacity and training details, the work would provide evidence that document-level hierarchical training improves robustness for parallel corpus mining on noisy web-scale data, with direct utility for machine translation data collection.
major comments (2)
- [Abstract] Abstract: The central claim that HiDE's hierarchical training produces the reported SOTA gains and robustness on noisy UN data cannot be evaluated because the text supplies no training procedure, model sizes, baseline hyper-parameters, dataset statistics, or ablation that isolates the hierarchy while holding other factors fixed.
- [Abstract] Abstract: The attribution of gains to the hierarchical architecture rather than unstated differences in capacity or data filtering is load-bearing for the main conclusion, yet no capacity-matched baselines or ablation removing the document-level hierarchy are described.
minor comments (1)
- [Abstract] Abstract: Error bars, multiple runs, or statistical significance tests are not reported for the P@1 figures.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.