{"total":23,"items":[{"citing_arxiv_id":"2605.12536","ref_index":151,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Information as Maximum-Caliber Deviation: A bridge between Integrated Information Theory and the Free Energy Principle","primary_cat":"q-bio.NC","submitted_at":"2026-05-03T07:22:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Information defined as maximum-caliber deviation derives IIT 3.0 cause-effect repertoires from constrained entropy maximization and equates to prediction error under CLT and LDT.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"measures how much the effective number of paths overX t →X t+1 shrinks whenX t is distributed asρrather than µ. 5.2.1 Introducing perplexity Perplexity was first introduce in language modeling to assess the complexity of speech recognition tasks[143]. It has since been widely adopted to assess the performance of language models, quantifying the effective branching factor of a model[151, 164]. Intuitively, it reflects the fact that a Kronecker-Delta distribution will place its mass on just one element, while a uniform distribution will traverse the entire state space. It is defined as the inverted geometric mean of a probability mass function[167]: Perp(ρ) = Y x∈Ω ρ(x)−ρ(x).(5.14) Equivalently, perplexity is an exponentiation of the entropy functional: Perp(ρ) =e H(ρ)."},{"citing_arxiv_id":"2311.16867","ref_index":298,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Falcon Series of Open Language Models","primary_cat":"cs.CL","submitted_at":"2023-11-28T15:12:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.06161","ref_index":221,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StarCoder: may the source be with you!","primary_cat":"cs.CL","submitted_at":"2023-05-09T08:16:42+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2205.10487","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Laws and Interpretability of Learning from Repeated Data","primary_cat":"cs.LG","submitted_at":"2022-05-21T02:14:27+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2204.14198","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Flamingo: a Visual Language Model for Few-Shot Learning","primary_cat":"cs.CV","submitted_at":"2022-04-29T16:29:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[51] Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring uniﬁed video-language pre-training. arXiv:2203.07303, 2022. [52] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv:1602.02410, 2016. [53] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020. [54] Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. The Hateful Memes Challenge: Detecting hate speech in"},{"citing_arxiv_id":"2201.11990","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model","primary_cat":"cs.CL","submitted_at":"2022-01-28T08:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Trained the largest monolithic 530B-parameter transformer language model to date and reported new state-of-the-art zero- and few-shot results on multiple NLP benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2201.08239","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LaMDA: Language Models for Dialog Applications","primary_cat":"cs.CL","submitted_at":"2022-01-20T15:44:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637, 2020. [19] Tomas Mikolov, Martin Karaﬁát, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur. Recurrent neural network based language model. In INTERSPEECH, 2010. [20] Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In ICML, 2011. [21] Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016. [22] Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation. InACL, 2018. [23] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional"},{"citing_arxiv_id":"2112.04426","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Improving language models by retrieving from trillions of tokens","primary_cat":"cs.CL","submitted_at":"2021-12-08T17:32:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2005.14165","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Language Models are Few-Shot Learners","primary_cat":"cs.CL","submitted_at":"2020-05-28T17:29:03+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"with the hope of stimulating further study of test-time behavior of language models. 3.9.1 Arithmetic To test GPT-3's ability to perform simple arithmetic operations without task-speciﬁc training, we developed a small battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language: • 2 digit addition (2D+) - The model is asked to add two integers sampled uniformly from [0, 100), phrased in the form of a question, e.g. \"Q: What is 48 plus 76? A: 124.\" • 2 digit subtraction (2D-) - The model is asked to subtract two integers sampled uniformly from [0, 100); the answer may be negative. Example: \"Q: What is 34 minus 53? A: -19\". • 3 digit addition (3D+) - Same as 2 digit addition, except numbers are uniformly sampled from [0, 1000). 21 Figure 3.10: Results on all 10 arithmetic tasks in the few-shot settings for models of different sizes. There is a signiﬁcant jump from the second largest model (GPT-3 13B) to the largest model (GPT-3 175), with the latter being able to reliably accurate 2 digit arithmetic, usually accurate 3 digit arithmetic, and correct answers a signiﬁcant fraction of the time on 4-5 digit arithmetic, 2 digit multiplication, and compound operations. Results for one-shot and zero-shot are shown in the appendix. • 3 digit subtraction (3D-) - Same as 2 digit subtraction, except numbers are uniformly sampled from[0, 1000). • 4 digit addition (4D+) - Same as 3 digit addition, except uniformly sampled from [0, 10000). • 4 digit subtraction (4D-) - Same as 3 digit subtraction, except uniformly sampled from [0, 10000). • 5 digit addition (5D+) - Same as 3 digit addition, except uniformly sampled from [0, 100000). • 5 digit subtraction (5D-) - Same as 3 digit subtraction, except uniformly sampled from [0, 100000). • 2 digit multiplication (2Dx) - The model is asked to multiply two integers sampled uniformly from [0, 100), e.g. \"Q: What is 24 times 42? A: 1008\". • One-digit composite (1DC) - The model is asked to perform a"},{"citing_arxiv_id":"1911.05507","ref_index":80,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Compressive Transformers for Long-Range Sequence Modelling","primary_cat":"cs.LG","submitted_at":"2019-11-13T14:36:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1911.02116","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unsupervised Cross-lingual Representation Learning at Scale","primary_cat":"cs.CL","submitted_at":"2019-11-05T22:42:00+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"XLM-R, pretrained on 100 languages with 2TB of CommonCrawl data, improves average XNLI accuracy by 14.6 points and MLQA F1 by 13 points over mBERT while matching strong monolingual models on GLUE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1910.10683","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","primary_cat":"cs.LG","submitted_at":"2019-10-23T17:37:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.11158","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cross-Lingual Transfer for Distantly Supervised and Low-resources Indonesian NER","primary_cat":"cs.CL","submitted_at":"2019-07-25T16:04:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Cross-lingual fine-tuning of pre-trained LMs yields significant gains on small gold Indonesian NER and competitive results on large silver data versus monolingual LM or POS transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.09273","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Why Build an Assistant in Minecraft?","primary_cat":"cs.AI","submitted_at":"2019-07-22T12:32:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A rationale is presented for developing an assistant in Minecraft to advance natural language understanding and dialogue learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.01677","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scalable Multi Corpora Neural Language Models for ASR","primary_cat":"cs.CL","submitted_at":"2019-07-02T23:28:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The authors report scalable training of neural LMs from heterogeneous corpora for ASR second-pass rescoring, delivering 6.2% relative WER reduction with minimal latency increase.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.01470","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Augmenting Self-attention with Persistent Memory","primary_cat":"cs.LG","submitted_at":"2019-07-02T15:56:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1904.10509","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generating Long Sequences with Sparse Transformers","primary_cat":"cs.LG","submitted_at":"2019-04-23T19:29:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1712.00409","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Deep Learning Scaling is Predictable, Empirically","primary_cat":"cs.LG","submitted_at":"2017-12-01T17:13:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1710.03740","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mixed Precision Training","primary_cat":"cs.AI","submitted_at":"2017-10-10T17:42:04+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1706.03762","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Attention Is All You Need","primary_cat":"cs.CL","submitted_at":"2017-06-12T17:57:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pith review generated a malformed one-line summary.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"recurrent nets: the difﬁculty of learning long-term dependencies, 2001. [13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997. [14] Zhongqiang Huang and Mary Harper. Self-training PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 832-841. ACL, August 2009. [15] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016. [16] Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016. [17] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms."},{"citing_arxiv_id":"1701.06538","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer","primary_cat":"cs.LG","submitted_at":"2017-01-23T18:10:00+00:00","verdict":"ACCEPT","verdict_confidence":"HIGH","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1609.03499","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WaveNet: A Generative Model for Raw Audio","primary_cat":"cs.SD","submitted_at":"2016-09-12T17:29:40+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":9.0,"formal_verification":"none","one_line_summary":"WaveNet generates realistic raw audio using an autoregressive neural network with dilated convolutions, achieving state-of-the-art naturalness in speech synthesis for English and Mandarin.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1605.08803","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Density estimation using Real NVP","primary_cat":"cs.LG","submitted_at":"2016-05-27T21:24:32+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"product of conditionals using the probability chain rule according to a ﬁxed ordering over dimensions, simplifying log-likelihood evaluation and sampling. Recent work in this line of research has taken advantage of recent advances in recurrent networks [51], in particular long-short term memory [26], and residual networks [25, 24] in order to learn state-of-the-art generative image models [61, 46] and language models [32]. The ordering of the dimensions, although often arbitrary, can be critical to the training of the model [66]. The sequential nature of this model limits its computational efﬁciency. For example, its sampling procedure is sequential and non-parallelizable, which can become cumbersome in applications like speech and music synthesis, or real-time rendering."}],"limit":50,"offset":0}