Multiplicative Models for Recurrent Language Modeling
Pith reviewed 2026-05-25 12:12 UTC · model grok-4.3
The pith
Shared parametrization of the second-order term improves performance in multiplicative recurrent language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing new multiplicative models that vary the degree of parameter sharing in the second-order term and measuring their performance on character-level language modeling, the authors establish that shared parametrization of the intermediate state is the relevant architectural feature for mitigating the correlation problem in recurrent sequence generation.
What carries the argument
The shared intermediate state in multiplicative recurrent updates, which computes a single second-order interaction term reused across time steps.
If this is right
- Recurrent generators become less likely to repeat earlier mistakes because the shared term reduces hidden-state correlation.
- Character-level language modeling perplexity improves when the second-order term is shared rather than recomputed independently at each step.
- The mLSTM architecture can be simplified or generalized while preserving its advantage by retaining only the shared intermediate state.
- Multiplicative extensions remain useful only when the sharing mechanism is kept; removing the share removes the benefit.
Where Pith is reading between the lines
- The same sharing principle could be tested on word-level or subword language modeling to check whether the benefit scales beyond characters.
- If the shared term acts as a low-rank bottleneck, similar sharing might stabilize training in other recurrent or state-space models.
- A direct ablation that isolates only the sharing variable while freezing all other weights would give a cleaner causal test than the current model variants.
Load-bearing premise
Any measured performance differences between the tested models can be attributed specifically to whether the second-order term is shared rather than to other unstated differences in architecture or training procedure.
What would settle it
Training a multiplicative model without shared parameters but otherwise identical in every other architectural detail and hyperparameter setting, then finding that its character-level perplexity matches or beats the shared version on the same datasets.
read the original abstract
Recently, there has been interest in multiplicative recurrent neural networks for language modeling. Indeed, simple Recurrent Neural Networks (RNNs) encounter difficulties recovering from past mistakes when generating sequences due to high correlation between hidden states. These challenges can be mitigated by integrating second-order terms in the hidden-state update. One such model, multiplicative Long Short-Term Memory (mLSTM) is particularly interesting in its original formulation because of the sharing of its second-order term, referred to as the intermediate state. We explore these architectural improvements by introducing new models and testing them on character-level language modeling tasks. This allows us to establish the relevance of shared parametrization in recurrent language modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces new multiplicative recurrent neural network variants that extend the mLSTM architecture by varying the parametrization of the second-order (intermediate) state term. It evaluates these models on character-level language modeling tasks and concludes that the results establish the relevance of shared parametrization for recurrent language modeling.
Significance. If the experiments properly isolate the effect of parameter sharing while holding all other architectural and training details fixed, the work would provide a useful clarification on why multiplicative interactions help RNNs recover from errors in sequence generation. The absence of any reported results, baselines, or controls in the provided abstract, however, leaves the practical impact undetermined.
major comments (2)
- [Abstract] Abstract: the claim that testing the new models 'allows us to establish the relevance of shared parametrization' is unsupported because the abstract supplies no experimental results, baselines, error analysis, or description of the tasks, making it impossible to evaluate whether any gains are observed or attributable to the claimed mechanism.
- [Abstract] Abstract / experimental design: the manuscript does not indicate that all other architectural choices (hidden-state update rules, number of parameters, optimization procedure) were held fixed while toggling only the sharing of the second-order term; without such controls, performance differences cannot be attributed specifically to shared parametrization rather than incidental model differences.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments on our manuscript. We address each major comment below and will make revisions to improve the clarity of the abstract.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that testing the new models 'allows us to establish the relevance of shared parametrization' is unsupported because the abstract supplies no experimental results, baselines, error analysis, or description of the tasks, making it impossible to evaluate whether any gains are observed or attributable to the claimed mechanism.
Authors: The abstract is intended as a high-level summary of the paper's contributions and conclusions. The full manuscript details the experimental setup on character-level language modeling tasks, including baselines and results that support the relevance of shared parametrization. We agree that the abstract could better hint at the evidence and will revise it to briefly note the key experimental findings and controls. revision: yes
-
Referee: [Abstract] Abstract / experimental design: the manuscript does not indicate that all other architectural choices (hidden-state update rules, number of parameters, optimization procedure) were held fixed while toggling only the sharing of the second-order term; without such controls, performance differences cannot be attributed specifically to shared parametrization rather than incidental model differences.
Authors: The models introduced are direct variants of mLSTM differing in the parametrization of the intermediate state, with other architectural elements and training procedures kept consistent across comparisons. This design isolates the effect of shared parametrization. However, we acknowledge that the abstract does not explicitly state this, and we will add clarifying language to ensure the experimental controls are clear. revision: yes
Circularity Check
No circularity: empirical comparison of new architectures does not reduce to self-definition or fitted inputs
full rationale
The paper introduces new multiplicative RNN variants and evaluates them on character-level language modeling tasks to argue for the value of shared second-order parametrization. No equations, derivations, or self-citations are supplied in the available text that would make any reported performance difference equivalent to its inputs by construction. The central claim rests on experimental outcomes rather than a mathematical reduction or renamed ansatz; therefore the derivation chain is self-contained and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluatio n of generic convolutional and recurrent networks for sequence modeling. CoRR abs/1803.01271 (2018), http://arxiv.org/abs/1803.01271
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
IEEE transactions on neural networks 5(2), 157–166 (1994)
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5(2), 157–166 (1994)
work page 1994
-
[3]
arXiv preprint arXiv:14 06.1078 (2014)
Cho, K., Van Merri¨ enboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning Phrase Representations using RNN E ncoder-Decoder for Statistical Machine Translation. arXiv preprint arXiv:14 06.1078 (2014)
work page 2014
-
[4]
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical Ev aluation of Gated Re- current Neural Networks on Sequence Modeling. arXiv prepri nt arXiv:1412.3555 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[5]
Cooijmans, T., Ballas, N., Laurent, C., Courville, A.C.: Recurrent Batch Normal- ization. CoRR abs/1603.09025 (2016), http://arxiv.org/abs/1603.09025
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
In: Pr oceedings of the 9th International Natural Language Generation conference
Ghodsi, A., DeNero, J.: An analysis of the ability of stati stical language mod- els to capture the structural properties of language. In: Pr oceedings of the 9th International Natural Language Generation conference. pp . 227–231 (2016)
work page 2016
-
[7]
Generating Sequences With Recurrent Neural Networks
Graves, A.: Generating Sequences With Recurrent Neural N etworks. CoRR abs/1308.0850 (2013), http://arxiv.org/abs/1308.0850
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[8]
IEEE transactions on neural ne tworks and learning systems (2016)
Greff, K., Srivastava, R.K., Koutn ´ ık, J., Steunebrink, B .R., Schmidhuber, J.: LSTM: A search space odyssey. IEEE transactions on neural ne tworks and learning systems (2016)
work page 2016
-
[9]
Neural computation 9(8), 1735–1780 (1997)
Hochreiter, S., Schmidhuber, J.: Long Short-term Memory . Neural computation 9(8), 1735–1780 (1997)
work page 1997
-
[10]
Hutter, M.: Human Knowledge Compression Contest (2006) , http://prize.hutter1.net/
work page 2006
-
[11]
In : Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces sing (EMNLP)
Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., Da um´ e III, H.: A neural network for factoid question answering over paragraphs. In : Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces sing (EMNLP). pp. 633–644 (2014)
work page 2014
-
[12]
In: Proceedings of the 32nd Interna tional Conference on Machine Learning (ICML-15)
Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirica l exploration of recurrent network architectures. In: Proceedings of the 32nd Interna tional Conference on Machine Learning (ICML-15). pp. 2342–2350 (2015)
work page 2015
-
[13]
Adam: A Method for Stochastic Optimization
Kingma, D.P., Ba, J.: Adam: A method for stochastic optim ization. CoRR abs/1412.6980 (2014), http://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[14]
Dynamic Evaluation of Neural Sequence Models
Krause, B., Kahembwe, E., Murray, I., Renals, S.: Dynami c eval- uation of neural sequence models. CoRR abs/1709.07432 (2017), http://arxiv.org/abs/1709.07432
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
Multiplicative LSTM for sequence modelling
Krause, B., Lu, L., Murray, I., Renals, S.: Multiplicati ve LSTM for Sequence Mod- elling. arXiv preprint arXiv:1609.07959 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
In: Proceedings of the 2015 Confer ence on Empirical Meth- ods in Natural Language Processing
Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neu- ral machine translation. In: Proceedings of the 2015 Confer ence on Empirical Meth- ods in Natural Language Processing. pp. 1412–1421 (2015)
work page 2015
-
[17]
Computational linguisti cs 19(2), 313–330 (1993)
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Buil ding a large annotated cor- pus of English: The Penn Treebank. Computational linguisti cs 19(2), 313–330 (1993)
work page 1993
-
[18]
Mikolov, T., Sutskever, I., Deoras, A., Le, H.S., Kombri nk, S., Cernocky, J.: Sub- word language modeling with neural networks. preprint (htt p://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf) (2012)
work page 2012
-
[19]
In: Advances in Neural Information Processing Systems
Mujika, A., Meier, F., Steger, A.: Fast-slow recurrent n eural networks. In: Advances in Neural Information Processing Systems. pp. 5917–5926 (2 017)
-
[20]
A Deep Reinforced Model for Abstractive Summarization
Paulus, R., Xiong, C., Socher, R.: A deep reinforced mode l for abstractive summa- rization. CoRR abs/1705.04304 (2017), http://arxiv.org/abs/1705.04304
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
Learning to Generate Reviews and Discovering Sentiment
Radford, A., J´ ozefowicz, R., Sutskever, I.: Learning t o Generate Reviews and Dis- covering Sentiment. CoRR abs/1704.01444 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
Socher, R., Bengio, Y., Manning, C.: Deep Learning for NL P. Tutorial at Associa- tion of Computational Logistics (ACL), 2012, and North Amer ican Chapter of the Association of Computational Linguistics (NAACL) (2013)
work page 2012
-
[23]
In: Proceedings of the 28th International Conferenc e on Machine Learning (ICML-11)
Sutskever, I., Martens, J., Hinton, G.E.: Generating te xt with recurrent neural net- works. In: Proceedings of the 28th International Conferenc e on Machine Learning (ICML-11). pp. 1017–1024 (2011)
work page 2011
-
[24]
In: Proceedings of the 26th annua l international con- ference on machine learning
Taylor, G.W., Hinton, G.E.: Factored conditional restr icted Boltzmann machines for modeling motion style. In: Proceedings of the 26th annua l international con- ference on machine learning. pp. 1025–1032. ACM (2009)
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.