Recognition: 2 theorem links
· Lean TheoremPointer Sentinel Mixture Models
Pith reviewed 2026-05-11 04:07 UTC · model grok-4.3
The pith
A pointer sentinel mixture model lets neural language models copy a word from recent context or generate one from a softmax classifier.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The pointer sentinel mixture architecture enables a neural sequence model to either reproduce a word from the recent context using a pointer or produce a word from a standard softmax classifier. The pointer sentinel LSTM variant sets a new state of the art on the Penn Treebank dataset with a perplexity of 70.9 while requiring substantially fewer parameters than a conventional softmax based LSTM.
What carries the argument
The pointer sentinel mixture model, which uses a sentinel mechanism to choose between pointing to a word in the recent context and generating from the softmax classifier.
Load-bearing premise
The pointer mechanism can reliably select the correct word from recent context without introducing many errors that degrade overall performance.
What would settle it
A test set where the model frequently points to the wrong recent word or fails to improve perplexity on a corpus containing many rare words that appear unambiguously in context.
read the original abstract
Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and larger corpora we also introduce the freely available WikiText corpus.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the pointer sentinel mixture architecture for neural sequence models. This model can either copy a word from recent context via a pointer mechanism or generate one from a standard softmax classifier. The pointer sentinel-LSTM variant is reported to achieve state-of-the-art language modeling performance on the Penn Treebank benchmark (70.9 perplexity) while using substantially fewer parameters than a comparable softmax LSTM. The paper also introduces the WikiText corpus to support evaluation on longer contexts and larger vocabularies.
Significance. If the performance claims hold, the work demonstrates that a lightweight mixture of copying and generation can improve perplexity and parameter efficiency on standard language modeling benchmarks. The introduction of the freely available WikiText corpus provides a useful resource for the community to study realistic vocabularies and longer-range dependencies. The approach directly targets the difficulty softmax-based models have with rare words when context is unambiguous.
major comments (2)
- [§4 and Table 1] §4 (Experimental Results) and Table 1: The headline claim of 70.9 perplexity on PTB with fewer parameters than a standard softmax LSTM is load-bearing for the paper's contribution, yet the manuscript provides no ablation or error analysis quantifying pointer accuracy (e.g., fraction of tokens correctly copied by the pointer versus fallback to softmax). Without this, it is impossible to confirm that pointing errors are not being compensated by the softmax component, which would undermine the claimed efficiency advantage.
- [§3.2] §3.2 (Pointer Sentinel Mixture): The sentinel gate is presented as deciding between pointer and softmax, but the training objective and inference procedure for the mixture are not shown to guarantee that the pointer component surfaces the correct token at a rate sufficient to explain the reported perplexity reduction. A concrete test (e.g., oracle pointer accuracy on the test set) is needed to secure the central claim.
minor comments (2)
- [§4.3] The WikiText corpus is introduced but its construction details (tokenization, preprocessing, train/valid/test splits) are only sketched; a short appendix or subsection with exact statistics and download instructions would improve reproducibility.
- [§3.2] Notation for the sentinel vector and attention scores in Eq. (3)–(5) is introduced without an explicit statement of how the final output distribution is normalized when the pointer and softmax are mixed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. The comments highlight important aspects of validating the pointer mechanism's contribution, and we address each point below with plans for revision.
read point-by-point responses
-
Referee: [§4 and Table 1] §4 (Experimental Results) and Table 1: The headline claim of 70.9 perplexity on PTB with fewer parameters than a standard softmax LSTM is load-bearing for the paper's contribution, yet the manuscript provides no ablation or error analysis quantifying pointer accuracy (e.g., fraction of tokens correctly copied by the pointer versus fallback to softmax). Without this, it is impossible to confirm that pointing errors are not being compensated by the softmax component, which would undermine the claimed efficiency advantage.
Authors: We agree that quantifying the pointer's accuracy and usage frequency would strengthen the claims. The original experiments emphasize end-to-end perplexity and parameter count, but we will add an ablation in the revised manuscript reporting the fraction of tokens for which the pointer is selected at inference time, along with the pointer's per-token accuracy on the PTB test set. This will clarify the extent to which the mixture relies on copying versus softmax generation. revision: partial
-
Referee: [§3.2] §3.2 (Pointer Sentinel Mixture): The sentinel gate is presented as deciding between pointer and softmax, but the training objective and inference procedure for the mixture are not shown to guarantee that the pointer component surfaces the correct token at a rate sufficient to explain the reported perplexity reduction. A concrete test (e.g., oracle pointer accuracy on the test set) is needed to secure the central claim.
Authors: The sentinel is trained jointly via the mixture loss, which directly optimizes the decision between components. To provide the requested concrete validation, we will include an oracle analysis in the revision: we will report the accuracy of an ideal pointer that always copies the correct token when it appears in the context window, and compare it to the learned sentinel's selection rate. This will bound the contribution of the pointer mechanism to the observed perplexity improvement. revision: partial
Circularity Check
No significant circularity; results are empirical measurements on external benchmarks
full rationale
The paper proposes a pointer-sentinel mixture architecture for language modeling, defines it via explicit equations for the pointer attention, sentinel gate, and mixture, then trains the model and reports perplexity on the fixed Penn Treebank benchmark. No step in the architecture or evaluation reduces by construction to its own inputs: the 70.9 perplexity figure is measured on held-out data rather than being a fitted parameter renamed as a prediction, and no uniqueness theorem, self-citation chain, or ansatz is invoked to force the result. The derivation chain is self-contained against external evaluation, consistent with standard empirical model papers.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...
-
Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters
Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
-
Learning the Signature of Memorization in Autoregressive Language Models
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
-
Editing Models with Task Arithmetic
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
-
The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models
Probabilistic circuits have an output bottleneck with convex probability combinations and a context bottleneck limited to fixed vtree-aligned partitions, making them less expressive than transformers for language data...
-
BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization
BCJR-QAT makes trellis quantization differentiable via BCJR soft decoding at finite temperature, allowing QAT to improve 2-bit LLM perplexity over PTQ with a fused GPU kernel and a drift-budget escape condition.
-
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases
ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
-
Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models
Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.
-
Why Geometric Continuity Emerges in Deep Neural Networks: Residual Connections and Rotational Symmetry Breaking
Residual connections align cross-layer gradients while symmetry-breaking activations prevent rotational drift, causing principal singular vectors of adjacent layers to align.
-
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
SplitZip is a new GPU-friendly lossless compressor for KV cache tensors that exploits exponent redundancy to achieve over 600 GB/s compression throughput and up to 1.32x faster transfers in disaggregated LLM serving.
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
-
BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs
BWLA is the first post-training quantization method for LLMs that achieves 1-bit weights paired with low-bit activations such as 6 bits, using OKT to reshape weights and suppress activation tails plus PSP for low-rank...
-
DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures
DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.9...
-
The Safety-Aware Denoiser for Text Diffusion Models
SAD modifies the denoising process in text diffusion models to enforce safety constraints at inference time, reducing unsafe generations while preserving quality and diversity.
-
Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers
In LLM feed-forward networks, the top 1% of channels per layer carry a median 58.7% of loss sensitivity, forming supernodes whose protection enables effective 50% sparsity pruning with much lower perplexity than baselines.
-
Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales
High-variance activation directions are uncorrelated with predictions, transformer blocks grow more linear with depth, and single-block linear replacement yields 34x compression on Mistral's final block at a 1.71 perp...
-
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
-
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
-
Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment
Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
-
From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU
A BFP NPU microarchitecture using row/column blocking and per-path protections achieves near-DMR reliability at 3.55% geometric mean performance overhead and under 2% hardware cost.
-
SHIELD: A Segmented Hierarchical Memory Architecture for Energy-Efficient LLM Inference on Edge NPUs
SHIELD reduces eDRAM refresh energy by 35% for LLM inference on edge NPUs by isolating sign/exponent from mantissa bits, disabling refresh on transient QO mantissas, and relaxing it on persistent KV mantissas while ke...
-
Gradient Boosting within a Single Attention Layer
Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over st...
-
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
-
Chronos: Learning the Language of Time Series
Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
-
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
-
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
-
Semantic Smoothing for Language Models via Distribution Estimation and Embeddings
Semantic smoothing formulates next-word distribution estimation under KL loss with embedding-based KL-proximity side information, yielding an interpolation estimator with worst-case risk O(min{Δ, d/n}) that empiricall...
-
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
-
Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model
CAKI generates class-specific prompts from few-shot samples of the same class, stores them in a knowledge bank, and uses query-key matching to inject relevant class knowledge into test instance predictions for improve...
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
-
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
FAAST analytically compiles labeled examples into fast weights via a single forward pass, matching backprop adaptation performance with over 90% less time and up to 95% less memory than memory-based methods.
-
DiCLIP: Diffusion Model Enhances CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation
DiCLIP uses diffusion-based visual correlation enhancement and text semantic augmentation to improve CLIP-generated class activation maps for weakly supervised semantic segmentation, outperforming prior methods on PAS...
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.
-
Context-Aware Wireless Token Communication via Joint Token Masking and Detection
A joint token masking and detection scheme with masked language models improves token reconstruction over noisy wireless channels by up to 1.77x on Europarl and 1.63x on WikiText-103 compared to conventional methods.
-
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
SplitZip delivers a GPU-friendly lossless KV-cache compressor using an offline top-16 exponent codebook plus escape stream, achieving 613 GB/s compression and 2182 GB/s decompression throughput with up to 1.32x end-to...
-
CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs
CoQuant selects optimal high-precision subspaces for mixed-precision LLM quantization via a closed-form weighted PCA that balances weight and activation covariances derived from expected output error.
-
FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression
FASQ delivers calibration-free LLM compression with continuous size trade-offs via product quantization and custom CUDA kernels that accelerate decode beyond FP16 speeds on consumer hardware.
-
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
Evidence for cross-modal representational convergence weakens substantially at scale and in realistic many-to-many settings, indicating models learn rich but distinct representations.
-
CLion: Efficient Cautious Lion Optimizer with Enhanced Generalization
CLion achieves O(1/N) generalization error and O(√d / T^{1/4}) convergence for nonconvex stochastic optimization, improving on Lion's O(1/(N τ^T)) bound.
-
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
-
Quantization Dominates Rank Reduction for KV-Cache Compression
Quantization of the KV cache beats rank reduction for matched storage budgets by 4-364 PPL, because dimension removal can flip attention token selection under softmax while bounded quantization noise usually preserves...
-
Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V
A position-agnostic nonlinear pre-projection MLP plus content skip connection in transformer attention improves LAMBADA accuracy by 40.6% and reduces perplexity by 39% on 160M-scale models.
-
LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training
LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.
-
A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need
Frozen random backbones with low-rank LoRA adapters recover 96-100% of fully trained performance on diverse architectures while training only 0.5-40% of parameters.
-
Rethinking Residual Errors in Compensation-based LLM Quantization
Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
-
SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models
SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.
-
Linformer: Self-Attention with Linear Complexity
Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
-
Emergent Semantic Role Understanding in Language Models
Semantic role understanding partially emerges during language model pre-training, with linear probes on frozen representations achieving substantial performance that improves with scale but does not match fine-tuned m...
-
mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters
Manifold-constrained multi-stream mixing plus per-stream adapters improves SSM language model validation loss from 6.3507 to 6.1353 and perplexity from 572.91 to 461.88 on WikiText-2.
-
Adaptive Memory Decay for Log-Linear Attention
Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.
-
TIDE: Every Layer Knows the Token Beneath the Context
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
-
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings ver...
-
Forge-UGC: FX optimization and register-graph engine for universal graph compiler
Forge-UGC delivers a hardware-agnostic four-phase compiler for transformers that reduces compilation time by 6.9-9.2x, inference latency by 18-36%, and energy use by 30-41% on NPU hardware compared with existing frameworks.
-
MUXQ: Mixed-to-Uniform Precision MatriX Quantization via Low-Rank Outlier Decomposition
MUXQ uses low-rank outlier decomposition to redistribute activation outliers, allowing mixed-to-uniform INT8 quantization of LLMs with lower perplexity than naive methods on GPT-2 models.
-
Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering
Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.
-
Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder
A reduced attention-only decoder shows diminishing returns in dataset scaling, reaching 90% of full accuracy with only 30% of the data.
Reference graph
Works this paper leans on
-
[1]
Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks
Adi, Yossi, Kermany, Einat, Belinkov, Yonatan, Lavi, Ofer, and Goldberg, Yoav. Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks. arXiv preprint arXiv:1608.04207,
-
[2]
arXiv preprint arXiv:1608.00318 , Title =
Ahn, Sungjin, Choi, Heeyoul, P ¨arnamaa, Tanel, and Ben- gio, Yoshua. A Neural Knowledge Language Model. CoRR, abs/1608.00318,
-
[3]
One billion word benchmark for measuring progress in statistical language modeling
Chelba, Ciprian, Mikolov, Tomas, Schuster, Mike, Ge, Qi, Brants, Thorsten, Koehn, Phillipp, and Robin- son, Tony. One Billion Word Benchmark for Measur- ing Progress in Statistical Language Modeling. arXiv preprint arXiv:1312.3005,
-
[4]
Long short-term memory-networks for machine reading.arXiv preprint arXiv:1601.06733,
Cheng, Jianpeng, Dong, Li, and Lapata, Mirella. Long Pointer Sentinel Mixture Models Short-Term Memory-Networks for Machine Reading. CoRR, abs/1601.06733,
-
[5]
A Theoretically Grounded Application of Dropout in Recurrent Neural Networks
Gal, Yarin. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. arXiv preprint arXiv:1512.05287,
- [6]
-
[7]
G¨ulc ¸ehre, C ¸ aglar, Ahn, Sungjin, Nallapati, Ramesh, Zhou, Bowen, and Bengio, Yoshua. Pointing the Unknown Words. arXiv preprint arXiv:1603.08148,
- [8]
-
[9]
Character-aware neural language models
Kim, Yoon, Jernite, Yacine, Sontag, David, and Rush, Alexander M. Character-aware neural language models. CoRR, abs/1508.06615,
-
[10]
Zoneout: Regularizing RNNs by Ran- domly Preserving Hidden Activations
Krueger, David, Maharaj, Tegan, Kram´ar, J´anos, Pezeshki, Mohammad, Ballas, Nicolas, Ke, Nan Rosemary, Goyal, Anirudh, Bengio, Yoshua, Larochelle, Hugo, Courville, Aaron, et al. Zoneout: Regularizing RNNs by Ran- domly Preserving Hidden Activations. arXiv preprint arXiv:1606.01305,
-
[11]
Latent Predictor Networks for Code Gen- eration
Ling, Wang, Grefenstette, Edward, Hermann, Karl Moritz, Kocisk´y, Tom ´as, Senior, Andrew, Wang, Fumin, and Blunsom, Phil. Latent Predictor Networks for Code Gen- eration. CoRR, abs/1603.06744,
-
[12]
How to Construct Deep Recurrent Neural Networks, April 2014
Pascanu, Razvan, C ¸ aglar G¨ulc ¸ehre, Cho, Kyunghyun, and Bengio, Yoshua. How to Construct Deep Recurrent Neu- ral Networks. CoRR, abs/1312.6026, 2013a. Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. In ICML, 2013b. Rosenfeld, Roni. A Maximum Entropy Approach to Adap- tive Statistical Languag...
-
[13]
Recurrent neural network regularization
Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329,
-
[14]
Zilly, Julian Georg, Srivastava, Rupesh Kumar, Koutn ´ık, Jan, and Schmidhuber, J¨urgen. Recurrent Highway Net- works. arXiv preprint arXiv:1607.03474,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.