Decomposable Neural Paraphrase Generation

Lifeng Shang; Qun Liu; Xin Jiang; Zichao Li

arxiv: 1906.09741 · v1 · pith:QJ6MKDKBnew · submitted 2019-06-24 · 💻 cs.CL

Decomposable Neural Paraphrase Generation

Zichao Li , Xin Jiang , Lifeng Shang , Qun Liu This is my paper

Pith reviewed 2026-05-25 17:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords paraphrase generationdisentangled representationsneural modelsdomain adaptationtransformergranularity levelsinterpretabilitycontrollability

0 comments

The pith

A neural paraphrase model with separate encoders and decoders for lexical, phrasal and sentential levels produces more controllable and domain-adaptable outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DNPG, a Transformer-based architecture that decomposes paraphrase generation into independent components, each tied to a different granularity of change. This separation is intended to make the process of creating alternative phrasings more interpretable and easier to steer than with standard end-to-end models. The same structure supports an unsupervised adaptation procedure that transfers the model to new domains. A reader would care because many downstream tasks rely on paraphrases, and greater control plus cross-domain robustness would reduce the need for labeled data in each new setting.

Core claim

DNPG consists of multiple encoders and decoders with differing structures, each responsible for paraphrasing at one granularity level (lexical, phrasal or sentential). The model learns to generate paraphrases in a disentangled fashion so that modifications at one level do not bleed into others. Empirical results indicate that this decomposition improves interpretability and controllability of the generation process. An unsupervised domain-adaptation method built on the same decomposition yields competitive in-domain accuracy and markedly stronger performance when the target domain differs from the training distribution.

What carries the argument

Multiple encoders and decoders with different structures, each assigned to a distinct granularity level of paraphrasing.

If this is right

Paraphrase outputs can be controlled by intervening on individual granularity-specific components.
The generation process becomes more interpretable because each component's contribution can be examined separately.
In-domain paraphrase quality remains competitive with existing neural models.
Unsupervised adaptation to a new domain produces significantly better results than non-decomposed baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular structure may allow targeted retraining or editing of only the components that need adjustment when new paraphrase styles are required.
Similar decomposition could be tested on other text-generation tasks where control at multiple scales is useful.
If the disentanglement holds, the model could support fine-grained debugging by isolating failures to specific granularity levels.

Load-bearing premise

The separate encoders and decoders can be trained to keep their representations disentangled by granularity level without substantial leakage or interference between components.

What would settle it

An ablation or inspection experiment that finds the output of one granularity component strongly influences or correlates with outputs from the other components, or that shows no gain in domain-adaptation performance, would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.09741 by Lifeng Shang, Qun Liu, Xin Jiang, Zichao Li.

**Figure 3.** Figure 3: Aggregator. aggregator, which combines the outputs from the m-decoders. More precisely, the aggregator first decides the probability of the next word being at each granularity. The previous word yt−1 and the context vectors c0 and c1 given by m-decoder0 and m-decoder1, are fed into a LSTM to make the prediction: vt = LSTM([Wc[c0; c1; yt−1]; vt−1]) P(zt |y1:t−1, X) = GS(Wvvt , τ ), (9) where vt is the hidd… view at source ↗

**Figure 2.** Figure 2: Attention: phrase-level self-attention (upper) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Left: Training of language model in the source domain; Right: RL training of separator in the target domain. the shallow fusion (Gulcehre et al., 2015) and the multi-task learning (MTL) (Domhan and Hieber, 2017) that harness the non-parallel data in the target domain for adaptation. For fair comparisons, we use the Transformer+Copy as the base model for shallow fusion and implement a variant of MTL with … view at source ↗

read the original abstract

Paraphrasing exists at different granularity levels, such as lexical level, phrasal level and sentential level. This paper presents Decomposable Neural Paraphrase Generator (DNPG), a Transformer-based model that can learn and generate paraphrases of a sentence at different levels of granularity in a disentangled way. Specifically, the model is composed of multiple encoders and decoders with different structures, each of which corresponds to a specific granularity. The empirical study shows that the decomposition mechanism of DNPG makes paraphrase generation more interpretable and controllable. Based on DNPG, we further develop an unsupervised domain adaptation method for paraphrase generation. Experimental results show that the proposed model achieves competitive in-domain performance compared to the state-of-the-art neural models, and significantly better performance when adapting to a new domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DNPG splits paraphrase generation into granularity-specific encoder-decoder pairs but provides no mechanism to keep those components from interfering during joint training.

read the letter

The paper's main contribution is an architecture that assigns dedicated encoder-decoder pairs to lexical, phrasal, and sentential paraphrase levels inside a Transformer backbone. This decomposition is presented as the route to better interpretability and controllability, plus an unsupervised domain-adaptation method built on top of it. The abstract reports competitive in-domain results against prior neural models and clearer gains when shifting domains. That combination of modularity and adaptation is the concrete advance worth noting. The design itself is straightforward to describe and follows logically from the goal of separating granularity levels. If the full experiments include component-wise ablations or output examples that show the pairs behaving differently, the work would give practitioners a usable template for controllable generation. The soft spot is the one flagged in the stress test. The model is trained jointly on standard paraphrase objectives with no auxiliary losses, orthogonality terms, or routing gates mentioned. Without something to discourage leakage across the granularity-specific modules, the representations could collapse or mix, which would undercut both the interpretability claim and the attribution of domain-adaptation gains to the decomposition. The abstract supplies no evidence that this was checked. The paper is aimed at people working on neural paraphrase systems for data augmentation or domain shift. A reader who needs modular generation ideas could extract the architecture and test it themselves. It is coherent enough on its own terms to merit peer review so that the experiments can be examined for whether the components actually stay separate and whether the reported gains survive proper controls.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Decomposable Neural Paraphrase Generator (DNPG), a Transformer-based model with multiple encoders and decoders of differing structures, each tied to a granularity level (lexical, phrasal, sentential). It claims the decomposition yields disentangled representations that improve interpretability and controllability of paraphrase generation, while also enabling an unsupervised domain-adaptation method that achieves competitive in-domain results and significantly better out-of-domain performance than prior neural models.

Significance. If the disentanglement is realized and empirically verified, the architecture would supply a concrete mechanism for level-specific control in paraphrase generation and a practical route to domain adaptation without parallel data. The explicit multi-component design is a clear architectural contribution worth testing against standard sequence-to-sequence baselines.

major comments (2)

[Model architecture] Model architecture section: the description states that the encoders/decoders have different structures and are trained jointly on standard paraphrase objectives, yet supplies no auxiliary loss, orthogonality penalty, routing gate, or information-bottleneck term that would enforce separation of the granularity-specific representations. Without such a mechanism, joint training on identical sentence pairs leaves open collapse or leakage across components, directly threatening both the interpretability/controllability claim and the domain-adaptation gains attributed to decomposition.
[Experiments] Experimental section: the abstract asserts 'significantly better performance when adapting to a new domain' and 'competitive in-domain performance,' but the manuscript must report concrete metrics (BLEU, iBLEU, or human scores), the exact baselines, the adaptation datasets, and at least one ablation that isolates the contribution of the decomposed components versus a monolithic Transformer. Absent these, the central empirical claims cannot be evaluated.

minor comments (2)

[Model architecture] Clarify the precise input/output interfaces between the granularity-specific modules and the shared components; the current description leaves the information flow ambiguous.
[Experiments] Add a figure or table that visualizes example outputs at each granularity level to substantiate the interpretability claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Model architecture] Model architecture section: the description states that the encoders/decoders have different structures and are trained jointly on standard paraphrase objectives, yet supplies no auxiliary loss, orthogonality penalty, routing gate, or information-bottleneck term that would enforce separation of the granularity-specific representations. Without such a mechanism, joint training on identical sentence pairs leaves open collapse or leakage across components, directly threatening both the interpretability/controllability claim and the domain-adaptation gains attributed to decomposition.

Authors: The architecture assigns encoders and decoders with explicitly different structures to each granularity level (lexical, phrasal, sentential) and trains them jointly. We argue that these structural differences, rather than an auxiliary loss, are the primary mechanism for encouraging separation. That said, we acknowledge the possibility of leakage under pure joint training and will add a dedicated paragraph in the model section discussing this design choice together with an ablation that measures cross-component information flow. revision: partial
Referee: [Experiments] Experimental section: the abstract asserts 'significantly better performance when adapting to a new domain' and 'competitive in-domain performance,' but the manuscript must report concrete metrics (BLEU, iBLEU, or human scores), the exact baselines, the adaptation datasets, and at least one ablation that isolates the contribution of the decomposed components versus a monolithic Transformer. Absent these, the central empirical claims cannot be evaluated.

Authors: Section 4 already lists the concrete BLEU/iBLEU scores, the full set of baselines (including standard Transformer seq2seq), the in-domain and out-of-domain datasets, and the unsupervised adaptation protocol. To directly isolate the decomposition, we will insert a new ablation table that replaces the multi-component DNPG with a single monolithic Transformer of comparable capacity while keeping all other training details fixed. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture evaluated on standard objectives

full rationale

The paper introduces DNPG as a Transformer variant with multiple encoders/decoders tied to lexical/phrasal/sentential granularity, claiming disentangled generation and improved domain adaptation. All load-bearing claims rest on experimental results rather than any derivation, equation, or self-referential definition. No fitted parameters are renamed as predictions, no uniqueness theorems are imported via self-citation, and no ansatz is smuggled in. The architecture is presented as a proposal whose benefits are measured directly against baselines; the absence of explicit disentanglement losses is a modeling choice, not a circular reduction of the claimed outcome to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the empirical effectiveness of the proposed decomposition and unsupervised adaptation; these are treated as validated by experiments whose details are absent from the abstract. No explicit free parameters, axioms, or invented entities beyond standard neural network training are named.

pith-pipeline@v0.9.0 · 5660 in / 1095 out tokens · 29150 ms · 2026-05-25T17:55:52.685174+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 4 internal anchors

[1]

On Using Monolingual Corpora in Neural Machine Translation

Joint copying and restricted generation for paraphrase. In Thirty-First AAAI Conference on Ar- tiﬁcial Intelligence. Hanjun Dai, Bo Dai, Yan-Ming Zhang, Shuang Li, and Le Song. 2016. Recurrent hidden semi-markov model. In International Conference on Learning Representations. Tobias Domhan and Felix Hieber. 2017. Using target- side monolingual data for neu...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Deep Recurrent Generative Decoder for Abstractive Text Summarization

Deep recurrent generative decoder for abstractive text summarization. arXiv preprint arXiv:1708.00625. Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Neural Paraphrase Generation with Stacked Residual LSTM Networks

Paraphrase generation with deep reinforce- ment learning. In Proceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing, pages 3865–3878. Yi Liao, Lidong Bing, Piji Li, Shuming Shi, Wai Lam, and Tong Zhang. 2018. Quase: Sequence editing under quantiﬁable guidance. In Proceedings of the 2018 Conference on Empirical Methods in ...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

A task in a suit and a tie: paraphrase generation with semantic augmentation

Get to the point: Summarization with pointer- generator networks. In Proceedings of the 55th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1073– 1083. Yu Su and Xifeng Yan. 2017. Cross-domain seman- tic parsing via paraphrasing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Langua...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

In Proceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages 3174–3187

Learning neural templates for text generation. In Proceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages 3174–3187. A Algorithm for extracting templates Algorithm 1 ExtractSentParaPattern INPUT: X,Y ,Zx,Zy,α′,V OUTPUT: ¯X, ¯Y 1: procedure EXTRACT ¯X 2: L ← |X|; 3: ¯X ← [ ]; 4: c ← 1; 5: p ← [ ]; 6: forl := 1 toL do...

work page 2018
[6]

The generated paraphrase does not make sense and is not human- generated text

Non-readable. The generated paraphrase does not make sense and is not human- generated text. Please note that readable is not equivalent to grammatical correct. That is, considered there are non-English speaker, a readable paraphrase can have grammar mis- takes

work page
[7]

The answer to the paraphrased question is not helpful to the owner of the original question

Readable but is not accurate. The answer to the paraphrased question is not helpful to the owner of the original question. For instance, how can i study c++ → what be c++. Here are some examples of accurate paraphrase: (a) how can i learn c++ → what be the best way to learn c++ (b) can i learn c++ in a easy way → be learn c++ hard (c) do you have some sug...

work page
[8]

Just remove or add some stop words

Accurate but with trivial paraphrasing. Just remove or add some stop words. For in- stance, why can trump win the president elec- tion → why can trump win president election

work page
[9]

More or loss, there is information loss of a non-trivial paraphrase

Novel paraphrasing. More or loss, there is information loss of a non-trivial paraphrase. Thus, again, determine whether the para- phrase is equivalent to the original question from the perspective of question owner. Fur- thermore, it is not necessary for a non-trivial paraphrase contains rare paraphrasing pat- tern. For instance, maybe there is lot of par...

work page
[10]

A generated para- phrase with [UNK] should generally have higher rank

There maybe special token, that is, [UNK] in the generated paraphrase. A generated para- phrase with [UNK] should generally have higher rank

work page
[11]

Otherwise, please try your best to distin- guish the quality of paraphrase

The same paraphrase should have same rank- ing. Otherwise, please try your best to distin- guish the quality of paraphrase

work page
[12]

Please do Google search ﬁrst when you see some strange word or phrase for better evalu- ation

work page
[13]

Just assume all the words are in their right form

Please note that all the words are stemmed and lower case. Just assume all the words are in their right form. For instance, what be you suggestion of some english movie is equiv- alent to What are your suggestions of some English movies

work page

[1] [1]

On Using Monolingual Corpora in Neural Machine Translation

Joint copying and restricted generation for paraphrase. In Thirty-First AAAI Conference on Ar- tiﬁcial Intelligence. Hanjun Dai, Bo Dai, Yan-Ming Zhang, Shuang Li, and Le Song. 2016. Recurrent hidden semi-markov model. In International Conference on Learning Representations. Tobias Domhan and Felix Hieber. 2017. Using target- side monolingual data for neu...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Deep Recurrent Generative Decoder for Abstractive Text Summarization

Deep recurrent generative decoder for abstractive text summarization. arXiv preprint arXiv:1708.00625. Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Neural Paraphrase Generation with Stacked Residual LSTM Networks

Paraphrase generation with deep reinforce- ment learning. In Proceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing, pages 3865–3878. Yi Liao, Lidong Bing, Piji Li, Shuming Shi, Wai Lam, and Tong Zhang. 2018. Quase: Sequence editing under quantiﬁable guidance. In Proceedings of the 2018 Conference on Empirical Methods in ...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

A task in a suit and a tie: paraphrase generation with semantic augmentation

Get to the point: Summarization with pointer- generator networks. In Proceedings of the 55th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1073– 1083. Yu Su and Xifeng Yan. 2017. Cross-domain seman- tic parsing via paraphrasing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Langua...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

In Proceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages 3174–3187

Learning neural templates for text generation. In Proceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages 3174–3187. A Algorithm for extracting templates Algorithm 1 ExtractSentParaPattern INPUT: X,Y ,Zx,Zy,α′,V OUTPUT: ¯X, ¯Y 1: procedure EXTRACT ¯X 2: L ← |X|; 3: ¯X ← [ ]; 4: c ← 1; 5: p ← [ ]; 6: forl := 1 toL do...

work page 2018

[6] [6]

The generated paraphrase does not make sense and is not human- generated text

Non-readable. The generated paraphrase does not make sense and is not human- generated text. Please note that readable is not equivalent to grammatical correct. That is, considered there are non-English speaker, a readable paraphrase can have grammar mis- takes

work page

[7] [7]

The answer to the paraphrased question is not helpful to the owner of the original question

Readable but is not accurate. The answer to the paraphrased question is not helpful to the owner of the original question. For instance, how can i study c++ → what be c++. Here are some examples of accurate paraphrase: (a) how can i learn c++ → what be the best way to learn c++ (b) can i learn c++ in a easy way → be learn c++ hard (c) do you have some sug...

work page

[8] [8]

Just remove or add some stop words

Accurate but with trivial paraphrasing. Just remove or add some stop words. For in- stance, why can trump win the president elec- tion → why can trump win president election

work page

[9] [9]

More or loss, there is information loss of a non-trivial paraphrase

Novel paraphrasing. More or loss, there is information loss of a non-trivial paraphrase. Thus, again, determine whether the para- phrase is equivalent to the original question from the perspective of question owner. Fur- thermore, it is not necessary for a non-trivial paraphrase contains rare paraphrasing pat- tern. For instance, maybe there is lot of par...

work page

[10] [10]

A generated para- phrase with [UNK] should generally have higher rank

There maybe special token, that is, [UNK] in the generated paraphrase. A generated para- phrase with [UNK] should generally have higher rank

work page

[11] [11]

Otherwise, please try your best to distin- guish the quality of paraphrase

The same paraphrase should have same rank- ing. Otherwise, please try your best to distin- guish the quality of paraphrase

work page

[12] [12]

Please do Google search ﬁrst when you see some strange word or phrase for better evalu- ation

work page

[13] [13]

Just assume all the words are in their right form

Please note that all the words are stemmed and lower case. Just assume all the words are in their right form. For instance, what be you suggestion of some english movie is equiv- alent to What are your suggestions of some English movies

work page