pith. machine review for the scientific record. sign in

arxiv: 2303.17564 · v3 · submitted 2023-03-30 · 💻 cs.LG · cs.AI· cs.CL· q-fin.GN

Recognition: 2 theorem links

· Lean Theorem

BloombergGPT: A Large Language Model for Finance

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLq-fin.GN
keywords large language modelsfinancial NLPdomain-specific training50 billion parametersmixed datasetBloombergGPTfinancial benchmarks
0
0 comments X

The pith

BloombergGPT, a 50 billion parameter model trained on financial plus general data, outperforms prior models on financial tasks while preserving general LLM performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BloombergGPT as a 50 billion parameter language model trained on a 363 billion token financial dataset drawn from Bloomberg sources, mixed with 345 billion tokens from general datasets. This mixed training is presented as the route to strong results on financial applications such as sentiment analysis, named entity recognition, and question answering. A sympathetic reader would care because the work shows a concrete way to build a domain-specialized LLM at scale without the usual drop in broad capabilities, and it supplies training details plus internal benchmarks that match intended use cases.

Core claim

BloombergGPT is a 50 billion parameter model trained on a combined corpus of 363 billion financial tokens and 345 billion general tokens; the resulting model exceeds existing models by substantial margins on financial benchmarks while matching performance on standard general-purpose LLM evaluations.

What carries the argument

The mixed financial-plus-general training corpus used to pretrain the 50 billion parameter transformer model.

If this is right

  • Financial NLP tasks such as sentiment analysis and question answering become more accurate with the specialized model.
  • The same mixed-dataset recipe can be applied to build other domain-specific models without sacrificing general capability.
  • Releasing the training process details allows other groups to replicate or adapt the approach at similar scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pattern may extend to other high-stakes domains where both specialized knowledge and general reasoning matter.
  • Collecting hundreds of billions of domain tokens appears feasible for organizations with proprietary data pipelines.
  • Public release of training logs sets a precedent for transparency that could influence future large-model projects.

Load-bearing premise

The chosen financial data sources and internal benchmarks accurately represent real financial usage and the observed gains arise from the training mix rather than from dataset artifacts or evaluation choices.

What would settle it

An independent evaluation on financial tasks drawn from sources outside the training corpus and the reported benchmarks would show whether the performance advantage holds.

read the original abstract

The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. Additionally, we explain our modeling choices, training process, and evaluation methodology. We release Training Chronicles (Appendix C) detailing our experience in training BloombergGPT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BloombergGPT, a 50 billion parameter language model trained on a mixed dataset of 363 billion financial tokens drawn from Bloomberg sources and 345 billion general-purpose tokens. It claims that this training regime produces a model that outperforms prior models on financial tasks by significant margins while preserving performance on standard general LLM benchmarks. Validation is reported across standard LLM benchmarks, open financial benchmarks, and a proprietary internal benchmark suite; the authors also document modeling choices, the training process, and evaluation methodology, and release Training Chronicles in Appendix C.

Significance. If the performance claims are substantiated, the work would constitute a notable contribution as the first reported large-scale domain-specific LLM for finance. The construction of what is described as one of the largest financial token datasets and the demonstration that mixed-domain training can improve financial-task performance without degrading general capabilities would be of direct interest to both the NLP and FinTech communities. The release of training chronicles adds practical value for reproducibility.

major comments (2)
  1. [Evaluation] Evaluation section (and abstract): The headline claim that mixed training yields 'significant margins' on financial tasks rests primarily on results from the authors' internal benchmark suite, which the text states 'most accurately reflect our intended usage.' No task definitions, question sources, scoring rubrics, contamination checks, or exclusion criteria are supplied for these benchmarks. Because the largest reported gains are tied to these undisclosed evaluations, independent verification of the central empirical result is impossible and the risk of selection bias or metric-specific artifacts cannot be assessed.
  2. [Evaluation] § on open financial benchmarks: While the paper references validation on open financial benchmarks, the text supplies no numerical tables, baseline comparisons, or error bars for these results either. The absence of concrete numbers leaves the 'outperforms existing models' assertion without direct quantitative support in the manuscript.
minor comments (2)
  1. [Abstract] Abstract: The abstract asserts benchmark outperformance but supplies no numerical results, error bars, baseline details, or exclusion criteria, leaving the central claim with limited direct support from the provided text.
  2. [Appendix C] Appendix C (Training Chronicles): Confirm that the released training log includes sufficient hyper-parameter schedules, hardware details, and any observed instabilities so that the training narrative can be followed by readers.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed review of our manuscript on BloombergGPT. The comments on the evaluation sections are well-taken, and we address each point below with clarifications and commitments to revisions where feasible while respecting necessary constraints on proprietary information.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (and abstract): The headline claim that mixed training yields 'significant margins' on financial tasks rests primarily on results from the authors' internal benchmark suite, which the text states 'most accurately reflect our intended usage.' No task definitions, question sources, scoring rubrics, contamination checks, or exclusion criteria are supplied for these benchmarks. Because the largest reported gains are tied to these undisclosed evaluations, independent verification of the central empirical result is impossible and the risk of selection bias or metric-specific artifacts cannot be assessed.

    Authors: We appreciate the referee's emphasis on transparency for the internal benchmarks. These evaluations are constructed from Bloomberg's proprietary data and use cases to best reflect real-world financial applications, which is why full task definitions, question sources, and specific rubrics cannot be disclosed without violating confidentiality. We will revise the manuscript to provide expanded high-level descriptions of task categories (e.g., financial sentiment, report summarization, entity extraction), general scoring methodologies, and contamination mitigation steps that do not reveal sensitive details. This will better contextualize the results and address concerns about selection bias while preserving the proprietary nature of the suite. revision: partial

  2. Referee: [Evaluation] § on open financial benchmarks: While the paper references validation on open financial benchmarks, the text supplies no numerical tables, baseline comparisons, or error bars for these results either. The absence of concrete numbers leaves the 'outperforms existing models' assertion without direct quantitative support in the manuscript.

    Authors: We agree that the open financial benchmark results should be presented with explicit quantitative support in the main text. The evaluation section includes these comparisons, but to improve clarity and address the concern directly, we will add a dedicated summary table reporting numerical performance metrics on the open benchmarks (including baselines from prior models), along with error bars from multiple evaluation runs where applicable. This revision will provide the direct quantitative evidence requested. revision: yes

standing simulated objections not resolved
  • Full release of proprietary internal benchmark task definitions, question sources, and specific instances due to confidentiality and data protection requirements.

Circularity Check

0 steps flagged

No circularity: empirical training and benchmark evaluation

full rationale

The paper reports construction of a mixed financial+general token dataset, training of a 50B model, and empirical evaluation on standard LLM benchmarks, open financial benchmarks, and internal suites. No equations, derivations, or first-principles claims are present that reduce to self-defined quantities, fitted parameters renamed as predictions, or self-citation chains. Performance margins are reported outcomes of training and testing rather than tautological restatements of inputs. The analysis criteria for circularity (self-definitional, fitted-input-as-prediction, load-bearing self-citation, etc.) are not met.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on standard LLM training assumptions plus the unstated premise that the chosen financial data distribution and internal benchmarks are representative; no new entities are postulated.

free parameters (2)
  • Model parameter count (50 billion)
    Chosen scale for the model, likely balancing compute and performance.
  • Financial-to-general token ratio (363B:345B)
    Dataset mix proportions selected to achieve domain gains without general degradation.
axioms (1)
  • domain assumption Standard transformer pretraining on next-token prediction transfers effectively to financial text when mixed with general data.
    Invoked to justify that mixed training preserves general capabilities.

pith-pipeline@v0.9.0 · 5516 in / 1234 out tokens · 45552 ms · 2026-05-13T23:14:46.985302+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

    q-fin.CP 2026-04 conditional novelty 8.0

    Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.

  2. MeMo: Memory as a Model

    cs.CL 2026-05 unverdicted novelty 7.0

    MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...

  3. AutoRedTrader: Autonomous Red Teaming of Trading Agents through Synthetic Misinformation Injection

    cs.CE 2026-05 unverdicted novelty 7.0

    AutoRedTrader generates synthetic financial misinformation via behavioral bias manipulation and agent feedback to red-team LLM trading agents, reaching 69% exposure and 26.67% attack success on Bitcoin data simulations.

  4. From Hypotheses to Factors: Constrained LLM Agents in Cryptocurrency Markets

    q-fin.PM 2026-04 unverdicted novelty 7.0

    Constrained LLM agents discover cryptocurrency factors that produce a portfolio with 44.55% annualized return and Sharpe ratio of 1.55 in pure out-of-sample 2024-2026 testing after trading costs.

  5. Detecting Corporate AI-Washing via Cross-Modal Semantic Inconsistency Learning

    cs.CY 2026-03 unverdicted novelty 7.0

    AWASH detects AI-washing via cross-modal inconsistency reasoning on a new trimodal benchmark of 88k corporate disclosure triplets, achieving F1 0.882 with a CMID network that grounds claims against patents and hiring data.

  6. Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.

  7. Agentic Retrieval-Augmented Generation for Financial Document Question Answering

    cs.AI 2026-05 unverdicted novelty 6.0

    FinAgent-RAG achieves 76.81-78.46% execution accuracy on financial QA benchmarks by combining contrastive retrieval, program-of-thought code generation, and adaptive strategy routing, outperforming baselines by 5.62-9...

  8. Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls

    cs.CL 2026-05 unverdicted novelty 6.0

    Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.

  9. RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization

    cs.CL 2026-04 unverdicted novelty 6.0

    RouteNLP is a closed-loop LLM routing framework using conformal cascading and distillation co-optimization that cut inference costs by 58% in an 8-week enterprise deployment while preserving 91% acceptance and high qu...

  10. Cross-Stock Predictability via LLM-Augmented Semantic Networks

    q-fin.PM 2026-04 unverdicted novelty 6.0

    LLM filtering of embedding-based stock networks raises long-short Sharpe ratio from 0.742 to 0.820 and cuts max drawdown from -10.47% to -7.85% in 2011-2019 S&P 500 backtests.

  11. QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance

    cs.MA 2026-04 unverdicted novelty 6.0

    QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.

  12. MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model

    cs.CE 2026-04 unverdicted novelty 6.0

    MFMDQwen is the first open-source LLM for multilingual financial misinformation detection, backed by a new instruction dataset and benchmark on which it outperforms other open-source models.

  13. SenseAI: A Human-in-the-Loop Dataset for RLHF-Aligned Financial Sentiment Reasoning

    cs.CL 2026-04 unverdicted novelty 6.0

    SenseAI is a human-in-the-loop financial sentiment dataset with reasoning processes and market outcomes that reveals predictable LLM error patterns like Latent Reasoning Drift for RLHF alignment.

  14. SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

    cs.SE 2026-04 unverdicted novelty 6.0

    SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.

  15. PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage

    cs.AI 2026-04 unverdicted novelty 6.0

    PolySwarm aggregates predictions from 50 LLM personas for Polymarket trading using Bayesian combination and divergence metrics, outperforming single models in calibration while adding latency arbitrage via CEX price models.

  16. CGCMA: Conditionally-Gated Cross-Modal Attention for Event-Conditioned Asynchronous Fusion

    cs.LG 2026-04 unverdicted novelty 6.0

    CGCMA separates text-conditioned grounding from lag-aware trust gating to fuse asynchronous price and web data, yielding the highest Sharpe ratio of +0.449 on a new crypto news corpus.

  17. Sell Me This Stock: Unsafe Recommendation Drift in LLM Agents

    cs.CL 2026-03 unverdicted novelty 6.0

    LLM agents exhibit evaluation blindness in multi-turn financial advice, with stronger models showing up to 99.1% suitability violations when tool data is manipulated, as internal detection fails to produce safer outputs.

  18. Jailbreaking Black Box Large Language Models in Twenty Queries

    cs.LG 2023-10 conditional novelty 6.0

    PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.

  19. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    cs.LG 2023-10 accept novelty 6.0

    SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.

  20. Semantic State Abstraction Interfaces for LLM-Augmented Portfolio Decisions: Multi-Axis News Decomposition and RL Diagnostics

    cs.LG 2026-05 unverdicted novelty 5.0

    SSAI maps news into four factors (sentiment, risk, confidence, volatility) for trading, but factor portfolios, ridge models, and RL agents show no reliable edge over baselines after coverage controls and costs.

  21. FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

    cs.AI 2026-04 unverdicted novelty 5.0

    FinGround reduces financial hallucinations by 68% over baselines in retrieval-equalized tests through atomic claim verification and grounding, with an 8B model retaining 91.4% F1 at low cost.

  22. When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies

    cs.CL 2026-04 unverdicted novelty 5.0

    LLM features optimized for high information coefficient with returns do not reliably improve PPO trading policies under distribution shifts, where price-only or macro baselines remain more robust.

  23. PRAGMA: Revolut Foundation Model

    cs.LG 2026-04 unverdicted novelty 5.0

    PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and ...

  24. CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization

    cs.CL 2026-04 unverdicted novelty 5.0

    CROP achieves 80.6% token reduction on GSM8K, LogiQA and BIG-Bench Hard with only nominal accuracy decline by regularizing automatic prompt optimization with response-length feedback.

  25. FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures

    cs.CL 2026-04 unverdicted novelty 5.0

    FinReporting builds a canonical ontology for income, balance, and cash flow statements and uses constrained LLM agents as verifiers to produce localized, auditable reports from US, Japanese, and Chinese filings.

  26. AI Agents in Financial Markets: Architecture, Applications, and Systemic Implications

    q-fin.GN 2026-03 unverdicted novelty 5.0

    The paper develops a four-layer AI agent architecture and the Agentic Financial Market Model linking agent parameters such as autonomy and coupling to market efficiency, liquidity, and systemic risk, with an explorato...

  27. A Multi-Agent Orchestration Framework for Venture Capital Due Diligence

    cs.MA 2026-05 unverdicted novelty 4.0

    A multi-agent orchestration framework automates VC due diligence using LLMs, web retrieval, and a programmatic pipeline to extract and parse official Greek business registry filings while flagging data gaps.

  28. AgenticAITA: A Proof-Of-Concept About Deliberative Multi-Agent Reasoning for Autonomous Trading Systems

    q-fin.TR 2026-05 unverdicted novelty 4.0

    AgenticAITA proposes a training-free multi-agent LLM framework for autonomous trading using a deliberative pipeline, Z-score triggers, and safety gates, shown to run correctly in a five-day live dry-run with 157 invocations.

  29. ComplianceNLP: Knowledge-Graph-Augmented RAG for Multi-Framework Regulatory Gap Detection

    cs.CL 2026-04 unverdicted novelty 4.0

    ComplianceNLP integrates knowledge-graph-augmented RAG, multi-task legal text extraction, and gap analysis to detect regulatory compliance gaps, reporting 87.7 F1 and real-world efficiency gains over GPT-4o baselines.

  30. Developing an ESG-Oriented Large Language Model through ESG Practices

    cs.CE 2026-03 unverdicted novelty 3.0

    ESG-adapted versions of Qwen-3-4B using LoRA and IRM outperform the base model and Llama-3/Gemma-3 baselines on generative ESG question-answering tasks.

Reference graph

Works this paper leans on

140 extracted references · 140 canonical work pages · cited by 30 Pith papers · 30 internal anchors

  1. [1]

    Artificial-Analysis

    Dogu Araci. Finbert: Financial sentiment analysis with pre-trained language models. arXiV preprint arXiV:1908.10063, 2019

  2. [2]

    PLATO - XL : Exploring the large-scale pre-training of dialogue generation

    Siqi Bao, Huang He, Fan Wang, Hua Wu, Haifeng Wang, Wenquan Wu, Zhihua Wu, Zhen Guo, Hua Lu, Xinxian Huang, Xin Tian, Xinchao Xu, Yingzhan Lin, and Zheng-Yu Niu. PLATO - XL : Exploring the large-scale pre-training of dialogue generation. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 107--118, Online only, November 2...

  3. [3]

    S ci BERT : A pretrained language model for scientific text

    Iz Beltagy, Kyle Lo, and Arman Cohan. S ci BERT : A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615--3620, Hong Kong, China, November 2019. Association for Computation...

  4. [4]

    On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610--623, 2021

    Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610--623, 2021

  5. [5]

    The fifth PASCAL recognizing textual entailment challenge

    Luisa Bentivogli, Bernardo Magnini, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. The fifth PASCAL recognizing textual entailment challenge. In Proceedings of the Second Text Analysis Conference, TAC 2009, Gaithersburg, Maryland, USA, November 16-17, 2009 . NIST , 2009. URL https://tac.nist.gov/publications/2009/additional.papers/RTE5\_overview.proce...

  6. [6]

    The values encoded in machine learning research

    Abeba Birhane, Pratyusha Kalluri, Dallas Card, William Agnew, Ravit Dotan, and Michelle Bao. The values encoded in machine learning research. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 173--184, 2022

  7. [7]

    PIQA: reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in...

  8. [8]

    org/10.5281/zenodo.5297715

    Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , March 2021. URL https://doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata

  9. [9]

    GPT - N eo X -20 B : An open-source autoregressive language model

    Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, Usvsn Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT - N eo X -20 B : An open-source autoregressive language model. In Proceedings of BigScience E...

  10. [10]

    BioMedLM

    Elliot Bolton, David Hall, Michihiro Yasunaga, Tony Lee, Chris Manning, and Percy Liang. BioMedLM . https://github.com/stanford-crfm/BioMedLM, 2023

  11. [11]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano...

  12. [12]

    Byte pair encoding is suboptimal for language model pretraining

    Kaj Bostrom and Greg Durrett. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617--4624, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.414. URL https://aclanthology.org/2020.findings-emnlp.414

  13. [13]

    Popat, Peng Xu, Franz J

    Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning ( EMNLP - C o NLL ) , pages 858--867, Prague, Czech Republic, June 2007. Association for Computat...

  14. [14]

    Class-based n-gram models of natural language

    Peter F Brown, Vincent J Della Pietra, Peter V Desouza, Jennifer C Lai, and Robert L Mercer. Class-based n-gram models of natural language. Computational linguistics, 18 0 (4): 0 467--480, 1992

  15. [15]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

  16. [16]

    Brown, Dawn Xiaodong Song, \'U lfar Erlingsson, Alina Oprea, and Colin Raffel

    Nicholas Carlini, Florian Tram \`e r, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Xiaodong Song, \'U lfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. In USENIX Security Symposium, 2020

  17. [17]

    Quantifying Memorization Across Neural Language Models

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models, 2022. URL https://arxiv.org/abs/2202.07646

  18. [18]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winte...

  19. [19]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiV preprint arXiV:1604.06174, 2016

  20. [20]

    F in QA : A dataset of numerical reasoning over financial data

    Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. F in QA : A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697--3711, Online and Punta Can...

  21. [21]

    C onv F in QA : Exploring the chain of numerical reasoning in conversational finance question answering

    Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. C onv F in QA : Exploring the chain of numerical reasoning in conversational finance question answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6279--6292, Abu Dhabi, United Arab Emirates, December 2022. Assoc...

  22. [22]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope, Ja...

  23. [23]

    doi:10.18653/v1/N19-1300 , pages =

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. B ool Q : Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Pape...

  24. [24]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiV, abs/1803.05457, 2018

  25. [25]

    The pascal recognising textual entailment challenge

    Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, 2007

  26. [26]

    The commitmentbank: Investigating projection in naturally occurring discourse

    Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, pages 107--124, 2019

  27. [27]

    Bernice: A multilingual pre-trained encoder for T witter

    Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Philip Resnik, and Mark Dredze. Bernice: A multilingual pre-trained encoder for T witter. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6191--6205, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL ht...

  28. [28]

    8-bit optimizers via block-wise quantization

    Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. In International Conference on Learning Representations, 2022

  29. [29]

    doi:10.18653/v1/N19-1423 , pages =

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171--4186, Minneap...

  30. [30]

    Documenting Large Webtext Corpora:

    Jesse Dodge, Maarten Sap, Ana Marasovi \'c , William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286--1305, Online and Punta Cana, Dominica...

  31. [31]

    How twitter is changing the nature of financial news discovery

    Mark Dredze, Prabhanjan Kambadur, Gary Kazantsev, Gideon Mann, and Miles Osborne. How twitter is changing the nature of financial news discovery. In proceedings of the second international workshop on data science for macro-modeling, pages 1--5, 2016

  32. [32]

    Natural language processing in accounting, auditing and finance: A synthesis of the literature with a roadmap for future research

    Ingrid E Fisher, Margaret R Garnsey, and Mark E Hughes. Natural language processing in accounting, auditing and finance: A synthesis of the literature with a roadmap for future research. Intelligent Systems in Accounting, Finance and Management, 23 0 (3): 0 157--214, 2016

  33. [33]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2021. URL https://arxiv.org/abs/2101.00027

  34. [34]

    Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text, 2022

    Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text, 2022. URL https://arxiv.org/abs/2202.06935

  35. [35]

    The third PASCAL recognizing textual entailment challenge

    Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL - PASCAL Workshop on Textual Entailment and Paraphrasing , pages 1--9, Prague, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/W07-1401

  36. [36]

    Improving alignment of dialogue agents via targeted human judgements

    Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nich...

  37. [37]

    Gordon, Zornitsa Kozareva, and Melissa Roemmele

    Andrew S. Gordon, Zornitsa Kozareva, and Melissa Roemmele. Semeval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In International Workshop on Semantic Evaluation, 2011

  38. [38]

    News summarization and evaluation in the era of gpt-3, 2022

    Tanya Goyal, Junyi Jessy Li, and Greg Durrett. News summarization and evaluation in the era of gpt-3, 2022. URL https://arxiv.org/abs/2209.12356

  39. [39]

    Suchin Gururangan, Ana Marasovi \'c , Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don ' t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342--8360, Online, July 2020. Association for Computational Linguistics. doi:...

  40. [40]

    The second pascal recognising textual entailment challenge

    R Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, volume 7, 2006

  41. [41]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiV preprint arXiV:1606.08415, 2016

  42. [42]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

  43. [43]

    Query-key normalization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4246--4253, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.379. URL https://aclanthology.org/2020.find...

  44. [44]

    2021 , journal=

    Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer. arXiV preprint arXiV:2102.01293, 2021

  45. [45]

    An empirical analysis of compute-optimal large language model training

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, and Laur...

  46. [46]

    Universal Language Model Fine-tuning for Text Classification

    Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328--339, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1031. URL https://aclanthology.o...

  47. [47]

    ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

    Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiV, 4 2019. URL http://arxiv.org/abs/1904.05342

  48. [48]

    Continuous speech recognition by statistical methods

    Frederick Jelinek. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64 0 (4): 0 532--556, 1976

  49. [49]

    Data governance in the age of large-scale data-driven language technology

    Yacine Jernite, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim Masoud, Valentin Danchev, Samson Tan, Alexandra Sasha Luccioni, Nishant Subramani, Isaac Johnson, Gerard Dupont, Jesse Dodge, Kyle Lo, Zeerak Talat, Dragomir Radev, Aaron Gokaslan, Somaieh Nikpoor, Peter Henderson, Rishi Bommasani, and Margaret Mitchell. Data governance in the age of large-s...

  50. [50]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiV, 1 2020. URL http://arxiv.org/abs/2001.08361

  51. [51]

    Amazon sagemaker model parallelism: A general and flexible framework for large model training

    Can Karakus, Rahul Huilgol, Fei Wu, Anirudh Subramanian, Cade Daniel, Derya Cavdar, Teng Xu, Haohan Chen, Arash Rahnama, and Luis Quintela. Amazon sagemaker model parallelism: A general and flexible framework for large model training. arXiV preprint arXiV:2111.05972, 2021

  52. [52]

    Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , url =

    Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 25...

  53. [53]

    Reducing activation recomputation in large transformer models, 2022

    Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models, 2022. URL https://arxiv.org/abs/2205.05198

  54. [54]

    Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

    Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66--75, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1007. URL https://...

  55. [55]

    SentencePiece:

    Taku Kudo and John Richardson. S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi:10....

  56. [56]

    doi:10.18653/v1/D17-1082 , pages =

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE : Large-scale R e A ding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785--794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/D17-1082. URL htt...

  57. [57]

    Teven Le Scao, Thomas Wang, Daniel Hesslow, Stas Bekman, M Saiful Bari, Stella Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, Ofir Press, Colin Raffel, Victor Sanh, Sheng Shen, Lintang Sutawika, Jaesung Tae, Zheng Xin Yong, Julien Launay, and Iz Beltagy. What language model to train if you have one million GPU hours? In Findings of the Associati...

  58. [58]

    Biobert: A pre-trained biomedical language representation model for biomedical text mining

    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36: 0 1234--1240, 2 2020. ISSN 14602059. doi:10.1093/bioinformatics/btz682

  59. [59]

    Deduplicating training data makes language models better

    Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424--8445, Dublin, Ireland, May 2022 a . Association for...

  60. [60]

    Evaluating human-language model interaction

    Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard - Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, Rose E. Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael S. Bernstein, and Percy Liang. Evaluating human-language model interaction. CoRR, abs/2212.09746, 2022 b . doi:10...

  61. [61]

    Smith, Zachary Ziegler, Daniel Nadler, Peter Szolovits, Alistair Johnson, and Emily Alsentzer

    Eric Lehman, Evan Hernandez, Diwakar Mahajan, Jonas Wulff, Micah J. Smith, Zachary Ziegler, Daniel Nadler, Peter Szolovits, Alistair Johnson, and Emily Alsentzer. Do we still need clinical language models?, 2023. URL https://arxiv.org/abs/2302.08091

  62. [62]

    Levesque, Ernest Davis, and L

    Hector J. Levesque, Ernest Davis, and L. Morgenstern. The winograd schema challenge. In International Conference on Principles of Knowledge Representation and Reasoning, 2011

  63. [63]

    Limits to depth efficiencies of self-attention

    Yoav Levine, Noam Wies, Or Sharir, Hofit Bata, and Amnon Shashua. Limits to depth efficiencies of self-attention. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 22640--22651. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ff4...

  64. [64]

    Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art

    Patrick Lewis, Myle Ott, Jingfei Du, and Veselin Stoyanov. Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 146--157, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.clinicalnl...

  65. [65]

    Solving Quantitative Reasoning Problems with Language Models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858

  66. [66]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher R \' e , Diana Acosta - Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong,...

  67. [67]

    Jurassic-1: Technical details and evaluation

    Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1, 2021

  68. [68]

    Language models of protein sequences at the scale of evolution enable accurate structure prediction

    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, and Alexander Rives. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022. doi:10.1101/2022.07.20.500902. URL https://www.biorxiv.org/content/early/2022...

  69. [69]

    Autoregressive structured prediction with language models

    Tianyu Liu, Yuchen Eleanor Jiang, Nicholas Monath, Ryan Cotterell, and Mrinmaya Sachan. Autoregressive structured prediction with language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 993--1005, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org...

  70. [70]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

  71. [71]

    BioGPT: generative pre-trained transformer for biomedical text generation and mining.Brief Bioinform.2022;23(6):bbac409

    Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. BioGPT : generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23 0 (6), sep 2022. doi:10.1093/bib/bbac409. URL https://doi.org/10.1093

  72. [72]

    Exploring cross-sentence contexts for named entity recognition with BERT

    Jouni Luoma and Sampo Pyysalo. Exploring cross-sentence contexts for named entity recognition with BERT . In Proceedings of the 28th International Conference on Computational Linguistics, pages 904--914, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi:10.18653/v1/2020.coling-main.78. URL https://aclantho...

  73. [73]

    Www'18 open challenge: Financial opinion mining and question answering

    Macedo Maia, Siegfried Handschuh, Andr \' e Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www'18 open challenge: Financial opinion mining and question answering. In Pierre - Antoine Champin, Fabien Gandon, Mounia Lalmas, and Panagiotis G. Ipeirotis, editors, Companion of the The Web Conference 2018 on The Web Conference 2018,...

  74. [74]

    Korhonen, Jyrki Wallenius, and Pyry Takala

    Pekka Malo, Ankur Sinha, Pekka J. Korhonen, Jyrki Wallenius, and Pyry Takala. Good debt or bad debt: Detecting semantic orientations in economic texts. J. Assoc. Inf. Sci. Technol., 65 0 (4): 0 782--796, 2014. doi:10.1002/asi.23062. URL https://doi.org/10.1002/asi.23062

  75. [75]

    Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP,

    Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y. Lee, Benoît Sagot, and Samson Tan. Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp, 2021. URL https://arxiv.org/abs/2112.10508

  76. [76]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , url =

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381--2391, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi:10.186...

  77. [77]

    Recurrent neural network based language model

    Tomas Mikolov, Martin Karafi \'a t, Lukas Burget, Jan Cernock \`y , and Sanjeev Khudanpur. Recurrent neural network based language model. In Interspeech, pages 1045--1048. Makuhari, 2010

  78. [78]

    A corpus and cloze evaluation for deeper understanding of commonsense stories

    Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies ...

  79. [79]

    BERT weet: A pre-trained language model for E nglish tweets

    Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. BERT weet: A pre-trained language model for E nglish tweets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 9--14, Online, October 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-demos.2. URL https://aclantho...

  80. [80]

    Adversarial

    Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial NLI : A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885--4901, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main....

Showing first 80 references.