{"work":{"id":"fef3833e-dc80-42a3-a1e0-ffbfafee6fff","openalex_id":null,"doi":null,"arxiv_id":"1609.07843","raw_key":null,"title":"Pointer Sentinel Mixture Models","authors":null,"authors_text":"Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher","year":2016,"venue":"cs.CL","abstract":"Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and larger corpora we also introduce the freely available WikiText corpus.","external_url":"https://arxiv.org/abs/1609.07843","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T13:15:51.665747+00:00","pith_arxiv_id":"1609.07843","created_at":"2026-05-09T06:20:42.324056+00:00","updated_at":"2026-05-25T13:15:51.665747+00:00","title_quality_ok":true,"display_title":"Pointer Sentinel Mixture Models","render_title":"Pointer Sentinel Mixture Models"},"hub":{"state":{"work_id":"fef3833e-dc80-42a3-a1e0-ffbfafee6fff","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":100,"external_cited_by_count":null,"distinct_field_count":12,"first_pith_cited_at":"2019-06-30T09:18:31+00:00","last_pith_cited_at":"2026-05-22T17:59:38+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-31T01:21:40.928524+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":9},{"context_role":"dataset","n":5},{"context_role":"method","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":9},{"context_polarity":"use_dataset","n":5},{"context_polarity":"unclear","n":1},{"context_polarity":"use_method","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Pointer Sentinel Mixture Models","claims":[{"claim_text":"Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Tree","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"as the primary safety judge, withQwen3Guard-Gen-8B[ 77] and human verification to resolve ambiguous cases and false positives. General Utility.We evaluate the general utility of the target mod- els before and after appending the adversarial suffix via CoLA (lin- guistic acceptability) [66], RTE (inference) [11], WinoGrande (com- monsense reasoning) [50], OpenBookQA (general knowledge) [36], and ARC-Challenge (grade-school science) [10]. Mechanistic Validation of Routing Hijacking.To assess wheth","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"IE Event Argument Extraction WikiEvent [139] [13], [27], [37], [42] RAMS [140] [36], [37] Relation Extraction T-REx [141],ZsRE [142] [27], [51] Reasoning Commonsense Reasoning HellaSwag [143] [20], [66] CoT Reasoning CoT Reasoning [144] [27] Complex Reasoning CSQA [145] [55] Others Language Understanding MMLU [146] [7], [27], [28], [42], [43], [47], [72] Language Modeling WikiText-103 [147] [5], [29], [64], [71] StrategyQA [148] [14], [24], [48], [51], [55], [58] Fact Checking/Verification FEVER","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Single-layer wrap with QATDenseDecoderLayer (BCJR mode), full-vocabulary forward-KL between the FP16 teacher and the partially-quantized student. Hyper- parameters and schedule choices are reported per-experiment in Section 4.5 (Table 2); the headline result usesη=2×10 −4,N=10 steps, and the skip-high-T schedule diagnosed in Section 4.6. Evaluation.Perplexity on WikiText-2 [ 19] and C4 [ 23] following the lm-eval-harness proto- col [9]. For OLMoE we additionally report downstream zero-shot ( num","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Both models are trained on FineWeb-Edu [ 26] (sample- 10BT) for ∼10B tokens (50,000 steps, sequence length 2,048, effective batch size 96) using DeepSpeed ZeRO Stage 1 with BF16 on a single NVIDIA B200. Evaluation.We report utilize the LM Evaluation Harness WikiText- 103 perplexity [22], standard benchmarks via lm-evaluation-harness (HellaSwag [37], PIQA [3], ARC [4], WinoGrande [27], LAMBADA [25]), needle-in-a-haystack retrieval at context lengths up to 4,096, and Table 2: Length generalization","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"[42] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397-2430. PMLR, 2023. [43] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016. [44] L","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"dataset that thoroughly exercises the model's computational graph, we eliminate dataset-induced variance as a confounding factor while keeping the computational overhead of exhaustive fault profiling tractable. B. Data Formats To evaluate the impact of hardware faults across modern mixed-precision regimes, we run all fault-injection campaigns under three training formats: IEEE FP16, BF16 [23], and FP8 [24], which are the de facto low-precision standards. Rather than varying datasets or model siz","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Pointer Sentinel Mixture Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (9 contexts).","role_counts":[{"n":9,"context_role":"background"},{"n":5,"context_role":"dataset"},{"n":1,"context_role":"method"},{"n":1,"context_role":"other"}]},"error":null,"updated_at":"2026-05-25T13:25:59.650488+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"27b67732-b2b0-496e-89af-c6f31d59da68","orcid":null,"display_name":"Stephen Merity"},{"id":"8e55947d-1829-4d9d-9b98-a9f8f0f33c12","orcid":null,"display_name":"Caiming Xiong"},{"id":"2a715075-cb56-422f-ade9-1fef4e745b91","orcid":null,"display_name":"James Bradbury"},{"id":"461cf849-4cd1-4bad-bb0b-e772f0204eeb","orcid":null,"display_name":"Richard Socher"}]},"error":null,"updated_at":"2026-05-25T13:26:00.363781+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T12:40:51.436217+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":13},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":12},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":9},{"title":"GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers","work_id":"19ed8c44-202a-48f6-8169-637d5a5f2408","shared_citers":7},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":7},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":6},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":6},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":6},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":5},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":5},{"title":"HellaSwag: Can a Machine Really Finish Your Sentence?","work_id":"79f44c0c-96f4-4edb-bc50-a3c9d6b85936","shared_citers":5},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":5},{"title":"Mixtral of Experts","work_id":"0de8c352-9daa-4e1e-8c7b-3d0dec69f369","shared_citers":5},{"title":"OPT: Open Pre-trained Transformer Language Models","work_id":"d7ff3b21-1fff-4cf4-952a-4714e3ef2307","shared_citers":5},{"title":"9 Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher","work_id":"489855f8-bd1c-4c87-a334-f6ab27d6707d","shared_citers":4},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":4},{"title":"AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration","work_id":"ea9d1d72-db24-4cae-8c89-4ecd83dd87c1","shared_citers":4},{"title":"BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions","work_id":"511eeb84-4b95-46d5-b14f-50da43f4f19f","shared_citers":4},{"title":"DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter","work_id":"756f9764-ecd6-4672-8043-b37c698c7ad2","shared_citers":4},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":4},{"title":"KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache","work_id":"735737c3-24e5-41c3-ab4f-04edcb36731c","shared_citers":4},{"title":"LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale","work_id":"98201f98-f4e5-4d1c-9ed7-b795e3c8f76c","shared_citers":4},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":4},{"title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","work_id":"4ee75248-1199-492c-a52f-6661e0f4adff","shared_citers":4}],"time_series":[{"n":1,"year":2019},{"n":1,"year":2020},{"n":3,"year":2022},{"n":1,"year":2024},{"n":1,"year":2025},{"n":48,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T12:50:34.554533+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T12:40:39.737609+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Pointer Sentinel Mixture Models","claims":[{"claim_text":"Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Tree","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"as the primary safety judge, withQwen3Guard-Gen-8B[ 77] and human verification to resolve ambiguous cases and false positives. General Utility.We evaluate the general utility of the target mod- els before and after appending the adversarial suffix via CoLA (lin- guistic acceptability) [66], RTE (inference) [11], WinoGrande (com- monsense reasoning) [50], OpenBookQA (general knowledge) [36], and ARC-Challenge (grade-school science) [10]. Mechanistic Validation of Routing Hijacking.To assess wheth","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"IE Event Argument Extraction WikiEvent [139] [13], [27], [37], [42] RAMS [140] [36], [37] Relation Extraction T-REx [141],ZsRE [142] [27], [51] Reasoning Commonsense Reasoning HellaSwag [143] [20], [66] CoT Reasoning CoT Reasoning [144] [27] Complex Reasoning CSQA [145] [55] Others Language Understanding MMLU [146] [7], [27], [28], [42], [43], [47], [72] Language Modeling WikiText-103 [147] [5], [29], [64], [71] StrategyQA [148] [14], [24], [48], [51], [55], [58] Fact Checking/Verification FEVER","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Single-layer wrap with QATDenseDecoderLayer (BCJR mode), full-vocabulary forward-KL between the FP16 teacher and the partially-quantized student. Hyper- parameters and schedule choices are reported per-experiment in Section 4.5 (Table 2); the headline result usesη=2×10 −4,N=10 steps, and the skip-high-T schedule diagnosed in Section 4.6. Evaluation.Perplexity on WikiText-2 [ 19] and C4 [ 23] following the lm-eval-harness proto- col [9]. For OLMoE we additionally report downstream zero-shot ( num","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Both models are trained on FineWeb-Edu [ 26] (sample- 10BT) for ∼10B tokens (50,000 steps, sequence length 2,048, effective batch size 96) using DeepSpeed ZeRO Stage 1 with BF16 on a single NVIDIA B200. Evaluation.We report utilize the LM Evaluation Harness WikiText- 103 perplexity [22], standard benchmarks via lm-evaluation-harness (HellaSwag [37], PIQA [3], ARC [4], WinoGrande [27], LAMBADA [25]), needle-in-a-haystack retrieval at context lengths up to 4,096, and Table 2: Length generalization","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"[42] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397-2430. PMLR, 2023. [43] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016. [44] L","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"dataset that thoroughly exercises the model's computational graph, we eliminate dataset-induced variance as a confounding factor while keeping the computational overhead of exhaustive fault profiling tractable. B. Data Formats To evaluate the impact of hardware faults across modern mixed-precision regimes, we run all fault-injection campaigns under three training formats: IEEE FP16, BF16 [23], and FP8 [24], which are the de facto low-precision standards. Rather than varying datasets or model siz","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Pointer Sentinel Mixture Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (9 contexts).","role_counts":[{"n":9,"context_role":"background"},{"n":5,"context_role":"dataset"},{"n":1,"context_role":"method"},{"n":1,"context_role":"other"}]},"error":null,"updated_at":"2026-05-25T13:26:00.368331+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Pointer Sentinel Mixture Models","claims":[{"claim_text":"Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Tree","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Pointer Sentinel Mixture Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T12:50:34.495087+00:00"}},"summary":{"title":"Pointer Sentinel Mixture Models","claims":[{"claim_text":"Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Tree","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Pointer Sentinel Mixture Models because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":13},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":12},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":9},{"title":"GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers","work_id":"19ed8c44-202a-48f6-8169-637d5a5f2408","shared_citers":7},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":7},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":6},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":6},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":6},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":5},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":5},{"title":"HellaSwag: Can a Machine Really Finish Your Sentence?","work_id":"79f44c0c-96f4-4edb-bc50-a3c9d6b85936","shared_citers":5},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":5},{"title":"Mixtral of Experts","work_id":"0de8c352-9daa-4e1e-8c7b-3d0dec69f369","shared_citers":5},{"title":"OPT: Open Pre-trained Transformer Language Models","work_id":"d7ff3b21-1fff-4cf4-952a-4714e3ef2307","shared_citers":5},{"title":"9 Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher","work_id":"489855f8-bd1c-4c87-a334-f6ab27d6707d","shared_citers":4},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":4},{"title":"AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration","work_id":"ea9d1d72-db24-4cae-8c89-4ecd83dd87c1","shared_citers":4},{"title":"BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions","work_id":"511eeb84-4b95-46d5-b14f-50da43f4f19f","shared_citers":4},{"title":"DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter","work_id":"756f9764-ecd6-4672-8043-b37c698c7ad2","shared_citers":4},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":4},{"title":"KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache","work_id":"735737c3-24e5-41c3-ab4f-04edcb36731c","shared_citers":4},{"title":"LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale","work_id":"98201f98-f4e5-4d1c-9ed7-b795e3c8f76c","shared_citers":4},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":4},{"title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","work_id":"4ee75248-1199-492c-a52f-6661e0f4adff","shared_citers":4}],"time_series":[{"n":1,"year":2019},{"n":1,"year":2020},{"n":3,"year":2022},{"n":1,"year":2024},{"n":1,"year":2025},{"n":48,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"8e55947d-1829-4d9d-9b98-a9f8f0f33c12","orcid":null,"display_name":"Caiming Xiong","source":"manual","import_confidence":0.72},{"id":"2a715075-cb56-422f-ade9-1fef4e745b91","orcid":null,"display_name":"James Bradbury","source":"manual","import_confidence":0.72},{"id":"461cf849-4cd1-4bad-bb0b-e772f0204eeb","orcid":null,"display_name":"Richard Socher","source":"manual","import_confidence":0.72},{"id":"27b67732-b2b0-496e-89af-c6f31d59da68","orcid":null,"display_name":"Stephen Merity","source":"manual","import_confidence":0.72}]}}