arxiv: 2303.17564 · v3 · submitted 2023-03-30 · 💻 cs.LG · cs.AI· cs.CL· q-fin.GN

Recognition: 2 theorem links

· Lean Theorem

BloombergGPT: A Large Language Model for Finance

Shijie Wu , Ozan Irsoy , Steven Lu , Vadim Dabravolski , Mark Dredze , Sebastian Gehrmann , Prabhanjan Kambadur , David Rosenberg

show 1 more author

Gideon Mann

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLq-fin.GN

keywords large language modelsfinancial NLPdomain-specific training50 billion parametersmixed datasetBloombergGPTfinancial benchmarks

0 comments

The pith

BloombergGPT, a 50 billion parameter model trained on financial plus general data, outperforms prior models on financial tasks while preserving general LLM performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BloombergGPT as a 50 billion parameter language model trained on a 363 billion token financial dataset drawn from Bloomberg sources, mixed with 345 billion tokens from general datasets. This mixed training is presented as the route to strong results on financial applications such as sentiment analysis, named entity recognition, and question answering. A sympathetic reader would care because the work shows a concrete way to build a domain-specialized LLM at scale without the usual drop in broad capabilities, and it supplies training details plus internal benchmarks that match intended use cases.

Core claim

BloombergGPT is a 50 billion parameter model trained on a combined corpus of 363 billion financial tokens and 345 billion general tokens; the resulting model exceeds existing models by substantial margins on financial benchmarks while matching performance on standard general-purpose LLM evaluations.

What carries the argument

The mixed financial-plus-general training corpus used to pretrain the 50 billion parameter transformer model.

If this is right

Financial NLP tasks such as sentiment analysis and question answering become more accurate with the specialized model.
The same mixed-dataset recipe can be applied to build other domain-specific models without sacrificing general capability.
Releasing the training process details allows other groups to replicate or adapt the approach at similar scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pattern may extend to other high-stakes domains where both specialized knowledge and general reasoning matter.
Collecting hundreds of billions of domain tokens appears feasible for organizations with proprietary data pipelines.
Public release of training logs sets a precedent for transparency that could influence future large-model projects.

Load-bearing premise

The chosen financial data sources and internal benchmarks accurately represent real financial usage and the observed gains arise from the training mix rather than from dataset artifacts or evaluation choices.

What would settle it

An independent evaluation on financial tasks drawn from sources outside the training corpus and the reported benchmarks would show whether the performance advantage holds.

read the original abstract

The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. Additionally, we explain our modeling choices, training process, and evaluation methodology. We release Training Chronicles (Appendix C) detailing our experience in training BloombergGPT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

BloombergGPT is the first reported 50B financial LLM trained on a 363B-token domain corpus mixed with general data, and the mixed approach looks workable, but the biggest claimed gains sit on internal benchmarks that outsiders cannot audit. The paper gives a clear picture of how they assembled the financial dataset from Bloomberg sources and combined it with 345B general tokens. They lay out the modeling decisions, training setup, and practical lessons in the Training Chronicles appendix, which is the part most likely to be useful to other groups trying similar work at scale. That documentation is a concrete contribution even if the model itself stays proprietary. The evaluation is the weaker part. The headline result—that the model beats existing ones on financial tasks without losing ground on general benchmarks—depends on a combination of open benchmarks and a proprietary internal suite. The paper notes that the internal tasks best match real usage, but it supplies no task definitions, scoring details, or contamination checks. Without those, it is hard to tell how much of the reported margin is robust versus tied to choices that cannot be reproduced. The abstract itself contains no numbers, so the size of the improvement stays hard to judge from the summary alone. This paper is mainly for people working on domain-adapted LLMs or financial NLP systems who want a concrete scaling example. It is worth sending to peer review because the dataset scale and mixed-training outcome are worth having in the record, provided the authors add more transparent numbers on the open benchmarks and clarify the internal evaluation setup.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BloombergGPT, a 50 billion parameter language model trained on a mixed dataset of 363 billion financial tokens drawn from Bloomberg sources and 345 billion general-purpose tokens. It claims that this training regime produces a model that outperforms prior models on financial tasks by significant margins while preserving performance on standard general LLM benchmarks. Validation is reported across standard LLM benchmarks, open financial benchmarks, and a proprietary internal benchmark suite; the authors also document modeling choices, the training process, and evaluation methodology, and release Training Chronicles in Appendix C.

Significance. If the performance claims are substantiated, the work would constitute a notable contribution as the first reported large-scale domain-specific LLM for finance. The construction of what is described as one of the largest financial token datasets and the demonstration that mixed-domain training can improve financial-task performance without degrading general capabilities would be of direct interest to both the NLP and FinTech communities. The release of training chronicles adds practical value for reproducibility.

major comments (2)

[Evaluation] Evaluation section (and abstract): The headline claim that mixed training yields 'significant margins' on financial tasks rests primarily on results from the authors' internal benchmark suite, which the text states 'most accurately reflect our intended usage.' No task definitions, question sources, scoring rubrics, contamination checks, or exclusion criteria are supplied for these benchmarks. Because the largest reported gains are tied to these undisclosed evaluations, independent verification of the central empirical result is impossible and the risk of selection bias or metric-specific artifacts cannot be assessed.
[Evaluation] § on open financial benchmarks: While the paper references validation on open financial benchmarks, the text supplies no numerical tables, baseline comparisons, or error bars for these results either. The absence of concrete numbers leaves the 'outperforms existing models' assertion without direct quantitative support in the manuscript.

minor comments (2)

[Abstract] Abstract: The abstract asserts benchmark outperformance but supplies no numerical results, error bars, baseline details, or exclusion criteria, leaving the central claim with limited direct support from the provided text.
[Appendix C] Appendix C (Training Chronicles): Confirm that the released training log includes sufficient hyper-parameter schedules, hardware details, and any observed instabilities so that the training narrative can be followed by readers.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed review of our manuscript on BloombergGPT. The comments on the evaluation sections are well-taken, and we address each point below with clarifications and commitments to revisions where feasible while respecting necessary constraints on proprietary information.

read point-by-point responses

Referee: [Evaluation] Evaluation section (and abstract): The headline claim that mixed training yields 'significant margins' on financial tasks rests primarily on results from the authors' internal benchmark suite, which the text states 'most accurately reflect our intended usage.' No task definitions, question sources, scoring rubrics, contamination checks, or exclusion criteria are supplied for these benchmarks. Because the largest reported gains are tied to these undisclosed evaluations, independent verification of the central empirical result is impossible and the risk of selection bias or metric-specific artifacts cannot be assessed.

Authors: We appreciate the referee's emphasis on transparency for the internal benchmarks. These evaluations are constructed from Bloomberg's proprietary data and use cases to best reflect real-world financial applications, which is why full task definitions, question sources, and specific rubrics cannot be disclosed without violating confidentiality. We will revise the manuscript to provide expanded high-level descriptions of task categories (e.g., financial sentiment, report summarization, entity extraction), general scoring methodologies, and contamination mitigation steps that do not reveal sensitive details. This will better contextualize the results and address concerns about selection bias while preserving the proprietary nature of the suite. revision: partial
Referee: [Evaluation] § on open financial benchmarks: While the paper references validation on open financial benchmarks, the text supplies no numerical tables, baseline comparisons, or error bars for these results either. The absence of concrete numbers leaves the 'outperforms existing models' assertion without direct quantitative support in the manuscript.

Authors: We agree that the open financial benchmark results should be presented with explicit quantitative support in the main text. The evaluation section includes these comparisons, but to improve clarity and address the concern directly, we will add a dedicated summary table reporting numerical performance metrics on the open benchmarks (including baselines from prior models), along with error bars from multiple evaluation runs where applicable. This revision will provide the direct quantitative evidence requested. revision: yes

standing simulated objections not resolved

Full release of proprietary internal benchmark task definitions, question sources, and specific instances due to confidentiality and data protection requirements.

Circularity Check

0 steps flagged

No circularity: empirical training and benchmark evaluation

full rationale

The paper reports construction of a mixed financial+general token dataset, training of a 50B model, and empirical evaluation on standard LLM benchmarks, open financial benchmarks, and internal suites. No equations, derivations, or first-principles claims are present that reduce to self-defined quantities, fitted parameters renamed as predictions, or self-citation chains. Performance margins are reported outcomes of training and testing rather than tautological restatements of inputs. The analysis criteria for circularity (self-definitional, fitted-input-as-prediction, load-bearing self-citation, etc.) are not met.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim depends on standard LLM training assumptions plus the unstated premise that the chosen financial data distribution and internal benchmarks are representative; no new entities are postulated.

free parameters (2)

Model parameter count (50 billion)
Chosen scale for the model, likely balancing compute and performance.
Financial-to-general token ratio (363B:345B)
Dataset mix proportions selected to achieve domain gains without general degradation.

axioms (1)

domain assumption Standard transformer pretraining on next-token prediction transfers effectively to financial text when mixed with general data.
Invoked to justify that mixed training preserves general capabilities.

pith-pipeline@v0.9.0 · 5516 in / 1234 out tokens · 45552 ms · 2026-05-13T23:14:46.985302+00:00 · methodology

discussion (0)

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data
q-fin.CP 2026-04 conditional novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
MeMo: Memory as a Model
cs.CL 2026-05 unverdicted novelty 7.0

MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
AutoRedTrader: Autonomous Red Teaming of Trading Agents through Synthetic Misinformation Injection
cs.CE 2026-05 unverdicted novelty 7.0

AutoRedTrader generates synthetic financial misinformation via behavioral bias manipulation and agent feedback to red-team LLM trading agents, reaching 69% exposure and 26.67% attack success on Bitcoin data simulations.
From Hypotheses to Factors: Constrained LLM Agents in Cryptocurrency Markets
q-fin.PM 2026-04 unverdicted novelty 7.0

Constrained LLM agents discover cryptocurrency factors that produce a portfolio with 44.55% annualized return and Sharpe ratio of 1.55 in pure out-of-sample 2024-2026 testing after trading costs.
Detecting Corporate AI-Washing via Cross-Modal Semantic Inconsistency Learning
cs.CY 2026-03 unverdicted novelty 7.0

AWASH detects AI-washing via cross-modal inconsistency reasoning on a new trimodal benchmark of 88k corporate disclosure triplets, achieving F1 0.882 with a CMID network that grounds claims against patents and hiring data.
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.
Agentic Retrieval-Augmented Generation for Financial Document Question Answering
cs.AI 2026-05 unverdicted novelty 6.0

FinAgent-RAG achieves 76.81-78.46% execution accuracy on financial QA benchmarks by combining contrastive retrieval, program-of-thought code generation, and adaptive strategy routing, outperforming baselines by 5.62-9...
Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls
cs.CL 2026-05 unverdicted novelty 6.0

Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization
cs.CL 2026-04 unverdicted novelty 6.0

RouteNLP is a closed-loop LLM routing framework using conformal cascading and distillation co-optimization that cut inference costs by 58% in an 8-week enterprise deployment while preserving 91% acceptance and high qu...
Cross-Stock Predictability via LLM-Augmented Semantic Networks
q-fin.PM 2026-04 unverdicted novelty 6.0

LLM filtering of embedding-based stock networks raises long-short Sharpe ratio from 0.742 to 0.820 and cuts max drawdown from -10.47% to -7.85% in 2011-2019 S&P 500 backtests.
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
cs.MA 2026-04 unverdicted novelty 6.0

QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model
cs.CE 2026-04 unverdicted novelty 6.0

MFMDQwen is the first open-source LLM for multilingual financial misinformation detection, backed by a new instruction dataset and benchmark on which it outperforms other open-source models.
SenseAI: A Human-in-the-Loop Dataset for RLHF-Aligned Financial Sentiment Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

SenseAI is a human-in-the-loop financial sentiment dataset with reasoning processes and market outcomes that reveals predictable LLM error patterns like Latent Reasoning Drift for RLHF alignment.
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics
cs.SE 2026-04 unverdicted novelty 6.0

SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage
cs.AI 2026-04 unverdicted novelty 6.0

PolySwarm aggregates predictions from 50 LLM personas for Polymarket trading using Bayesian combination and divergence metrics, outperforming single models in calibration while adding latency arbitrage via CEX price models.
CGCMA: Conditionally-Gated Cross-Modal Attention for Event-Conditioned Asynchronous Fusion
cs.LG 2026-04 unverdicted novelty 6.0

CGCMA separates text-conditioned grounding from lag-aware trust gating to fuse asynchronous price and web data, yielding the highest Sharpe ratio of +0.449 on a new crypto news corpus.
Jailbreaking Black Box Large Language Models in Twenty Queries
cs.LG 2023-10 conditional novelty 6.0

PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
cs.LG 2023-10 accept novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Semantic State Abstraction Interfaces for LLM-Augmented Portfolio Decisions: Multi-Axis News Decomposition and RL Diagnostics
cs.LG 2026-05 unverdicted novelty 5.0

SSAI maps news into four factors (sentiment, risk, confidence, volatility) for trading, but factor portfolios, ridge models, and RL agents show no reliable edge over baselines after coverage controls and costs.
FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification
cs.AI 2026-04 unverdicted novelty 5.0

FinGround reduces financial hallucinations by 68% over baselines in retrieval-equalized tests through atomic claim verification and grounding, with an 8B model retaining 91.4% F1 at low cost.
When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies
cs.CL 2026-04 unverdicted novelty 5.0

LLM features optimized for high information coefficient with returns do not reliably improve PPO trading policies under distribution shifts, where price-only or macro baselines remain more robust.
PRAGMA: Revolut Foundation Model
cs.LG 2026-04 unverdicted novelty 5.0

PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and ...
CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization
cs.CL 2026-04 unverdicted novelty 5.0

CROP achieves 80.6% token reduction on GSM8K, LogiQA and BIG-Bench Hard with only nominal accuracy decline by regularizing automatic prompt optimization with response-length feedback.
FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures
cs.CL 2026-04 unverdicted novelty 5.0

FinReporting builds a canonical ontology for income, balance, and cash flow statements and uses constrained LLM agents as verifiers to produce localized, auditable reports from US, Japanese, and Chinese filings.
A Multi-Agent Orchestration Framework for Venture Capital Due Diligence
cs.MA 2026-05 unverdicted novelty 4.0

A multi-agent orchestration framework automates VC due diligence using LLMs, web retrieval, and a programmatic pipeline to extract and parse official Greek business registry filings while flagging data gaps.
AgenticAITA: A Proof-Of-Concept About Deliberative Multi-Agent Reasoning for Autonomous Trading Systems
q-fin.TR 2026-05 unverdicted novelty 4.0

AgenticAITA proposes a training-free multi-agent LLM framework for autonomous trading using a deliberative pipeline, Z-score triggers, and safety gates, shown to run correctly in a five-day live dry-run with 157 invocations.
ComplianceNLP: Knowledge-Graph-Augmented RAG for Multi-Framework Regulatory Gap Detection
cs.CL 2026-04 unverdicted novelty 4.0

ComplianceNLP integrates knowledge-graph-augmented RAG, multi-task legal text extraction, and gap analysis to detect regulatory compliance gaps, reporting 87.7 F1 and real-world efficiency gains over GPT-4o baselines.
Developing an ESG-Oriented Large Language Model through ESG Practices
cs.CE 2026-03 unverdicted novelty 3.0

ESG-adapted versions of Qwen-3-4B using LoRA and IRM outperform the base model and Llama-3/Gemma-3 baselines on generative ESG question-answering tasks.

Reference graph

Works this paper leans on

140 extracted references · 140 canonical work pages · cited by 28 Pith papers · 29 internal anchors

[1]

Artificial-Analysis

Dogu Araci. Finbert: Financial sentiment analysis with pre-trained language models. arXiV preprint arXiV:1908.10063, 2019

work page arXiv 1908
[2]

PLATO - XL : Exploring the large-scale pre-training of dialogue generation

Siqi Bao, Huang He, Fan Wang, Hua Wu, Haifeng Wang, Wenquan Wu, Zhihua Wu, Zhen Guo, Hua Lu, Xinxian Huang, Xin Tian, Xinchao Xu, Yingzhan Lin, and Zheng-Yu Niu. PLATO - XL : Exploring the large-scale pre-training of dialogue generation. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 107--118, Online only, November 2...

work page 2022
[3]

S ci BERT : A pretrained language model for scientific text

Iz Beltagy, Kyle Lo, and Arman Cohan. S ci BERT : A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615--3620, Hong Kong, China, November 2019. Association for Computation...

work page doi:10.18653/v1/d19-1371 2019
[4]

On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610--623, 2021

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610--623, 2021

work page 2021
[5]

The fifth PASCAL recognizing textual entailment challenge

Luisa Bentivogli, Bernardo Magnini, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. The fifth PASCAL recognizing textual entailment challenge. In Proceedings of the Second Text Analysis Conference, TAC 2009, Gaithersburg, Maryland, USA, November 16-17, 2009 . NIST , 2009. URL https://tac.nist.gov/publications/2009/additional.papers/RTE5\_overview.proce...

work page 2009
[6]

The values encoded in machine learning research

Abeba Birhane, Pratyusha Kalluri, Dallas Card, William Agnew, Ravit Dotan, and Michelle Bao. The values encoded in machine learning research. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 173--184, 2022

work page 2022
[7]

PIQA: reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in...

work page 2020
[8]

org/10.5281/zenodo.5297715

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , March 2021. URL https://doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata

work page doi:10.5281/zenodo.5297715 2021
[9]

GPT - N eo X -20 B : An open-source autoregressive language model

Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, Usvsn Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT - N eo X -20 B : An open-source autoregressive language model. In Proceedings of BigScience E...

work page doi:10.18653/v1/2022.bigscience-1.9 2022
[10]

BioMedLM

Elliot Bolton, David Hall, Michihiro Yasunaga, Tony Lee, Chris Manning, and Percy Liang. BioMedLM . https://github.com/stanford-crfm/BioMedLM, 2023

work page 2023
[11]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Byte pair encoding is suboptimal for language model pretraining

Kaj Bostrom and Greg Durrett. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617--4624, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.414. URL https://aclanthology.org/2020.findings-emnlp.414

work page doi:10.18653/v1/2020.findings-emnlp.414 2020
[13]

Popat, Peng Xu, Franz J

Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning ( EMNLP - C o NLL ) , pages 858--867, Prague, Czech Republic, June 2007. Association for Computat...

work page 2007
[14]

Class-based n-gram models of natural language

Peter F Brown, Vincent J Della Pietra, Peter V Desouza, Jennifer C Lai, and Robert L Mercer. Class-based n-gram models of natural language. Computational linguistics, 18 0 (4): 0 467--480, 1992

work page 1992
[15]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page 2020
[16]

Brown, Dawn Xiaodong Song, \'U lfar Erlingsson, Alina Oprea, and Colin Raffel

Nicholas Carlini, Florian Tram \`e r, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Xiaodong Song, \'U lfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. In USENIX Security Symposium, 2020

work page 2020
[17]

Quantifying Memorization Across Neural Language Models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models, 2022. URL https://arxiv.org/abs/2202.07646

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winte...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiV preprint arXiV:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

F in QA : A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. F in QA : A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697--3711, Online and Punta Can...

work page doi:10.18653/v1/2021.emnlp-main.300 2021
[21]

C onv F in QA : Exploring the chain of numerical reasoning in conversational finance question answering

Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. C onv F in QA : Exploring the chain of numerical reasoning in conversational finance question answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6279--6292, Abu Dhabi, United Arab Emirates, December 2022. Assoc...

work page 2022
[22]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope, Ja...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. B ool Q : Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Pape...

work page doi:10.18653/v1/n19-1300 2019
[24]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiV, abs/1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

The pascal recognising textual entailment challenge

Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, 2007

work page 2007
[26]

The commitmentbank: Investigating projection in naturally occurring discourse

Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, pages 107--124, 2019

work page 2019
[27]

Bernice: A multilingual pre-trained encoder for T witter

Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Philip Resnik, and Mark Dredze. Bernice: A multilingual pre-trained encoder for T witter. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6191--6205, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL ht...

work page 2022
[28]

8-bit optimizers via block-wise quantization

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. In International Conference on Learning Representations, 2022

work page 2022
[29]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171--4186, Minneap...

work page doi:10.18653/v1/n19-1423 2019
[30]

Documenting large webtext corpora: A case study on the colossal clean crawled corpus

Jesse Dodge, Maarten Sap, Ana Marasovi \'c , William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286--1305, Online and Punta Cana, Dominica...

work page doi:10.18653/v1/2021.emnlp-main.98 2021
[31]

How twitter is changing the nature of financial news discovery

Mark Dredze, Prabhanjan Kambadur, Gary Kazantsev, Gideon Mann, and Miles Osborne. How twitter is changing the nature of financial news discovery. In proceedings of the second international workshop on data science for macro-modeling, pages 1--5, 2016

work page 2016
[32]

Natural language processing in accounting, auditing and finance: A synthesis of the literature with a roadmap for future research

Ingrid E Fisher, Margaret R Garnsey, and Mark E Hughes. Natural language processing in accounting, auditing and finance: A synthesis of the literature with a roadmap for future research. Intelligent Systems in Accounting, Finance and Management, 23 0 (3): 0 157--214, 2016

work page 2016
[33]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2021. URL https://arxiv.org/abs/2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text, 2022

Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text, 2022. URL https://arxiv.org/abs/2202.06935

work page arXiv 2022
[35]

The third PASCAL recognizing textual entailment challenge

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL - PASCAL Workshop on Textual Entailment and Paraphrasing , pages 1--9, Prague, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/W07-1401

work page 2007
[36]

Improving alignment of dialogue agents via targeted human judgements

Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nich...

work page internal anchor Pith review arXiv 2022
[37]

Gordon, Zornitsa Kozareva, and Melissa Roemmele

Andrew S. Gordon, Zornitsa Kozareva, and Melissa Roemmele. Semeval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In International Workshop on Semantic Evaluation, 2011

work page 2012
[38]

News summarization and evaluation in the era of gpt-3, 2022

Tanya Goyal, Junyi Jessy Li, and Greg Durrett. News summarization and evaluation in the era of gpt-3, 2022. URL https://arxiv.org/abs/2209.12356

work page arXiv 2022
[39]

Suchin Gururangan, Ana Marasovi \'c , Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don ' t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342--8360, Online, July 2020. Association for Computational Linguistics. doi:...

work page doi:10.18653/v1/2020.acl-main.740 2020
[40]

The second pascal recognising textual entailment challenge

R Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, volume 7, 2006

work page 2006
[41]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiV preprint arXiV:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[42]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021
[43]

Query-key normalization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4246--4253, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.379. URL https://aclanthology.org/2020.find...

work page doi:10.18653/v1/2020.findings-emnlp.379 2020
[44]

Scaling laws for transfer

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer. arXiV preprint arXiV:2102.01293, 2021

work page arXiv 2021
[45]

An empirical analysis of compute-optimal large language model training

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, and Laur...

work page 2022
[46]

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328--339, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1031. URL https://aclanthology.o...

work page doi:10.18653/v1/p18-1031 2018
[47]

& Ranganath, R

Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiV, 4 2019. URL http://arxiv.org/abs/1904.05342

work page arXiv 2019
[48]

Continuous speech recognition by statistical methods

Frederick Jelinek. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64 0 (4): 0 532--556, 1976

work page 1976
[49]

Data governance in the age of large-scale data-driven language technology

Yacine Jernite, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim Masoud, Valentin Danchev, Samson Tan, Alexandra Sasha Luccioni, Nishant Subramani, Isaac Johnson, Gerard Dupont, Jesse Dodge, Kyle Lo, Zeerak Talat, Dragomir Radev, Aaron Gokaslan, Somaieh Nikpoor, Peter Henderson, Rishi Bommasani, and Margaret Mitchell. Data governance in the age of large-s...

work page doi:10.1145/3531146.3534637 2022
[50]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiV, 1 2020. URL http://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[51]

Amazon sagemaker model parallelism: A general and flexible framework for large model training

Can Karakus, Rahul Huilgol, Fei Wu, Anirudh Subramanian, Cade Daniel, Derya Cavdar, Teng Xu, Haohan Chen, Arash Rahnama, and Luis Quintela. Amazon sagemaker model parallelism: A general and flexible framework for large model training. arXiV preprint arXiV:2111.05972, 2021

work page arXiv 2021
[52]

Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 25...

work page doi:10.18653/v1/n18-1023 2018
[53]

Reducing activation recomputation in large transformer models, 2022

Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models, 2022. URL https://arxiv.org/abs/2205.05198

work page arXiv 2022
[54]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66--75, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1007. URL https://...

work page doi:10.18653/v1/p18-1007 2018
[55]

S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Taku Kudo and John Richardson. S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi:10....

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
[56]

RACE : Large-scale R e A ding comprehension dataset from examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE : Large-scale R e A ding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785--794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/D17-1082. URL htt...

work page doi:10.18653/v1/d17-1082 2017
[57]

Teven Le Scao, Thomas Wang, Daniel Hesslow, Stas Bekman, M Saiful Bari, Stella Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, Ofir Press, Colin Raffel, Victor Sanh, Sheng Shen, Lintang Sutawika, Jaesung Tae, Zheng Xin Yong, Julien Launay, and Iz Beltagy. What language model to train if you have one million GPU hours? In Findings of the Associati...

work page 2022
[58]

Biobert: A pre-trained biomedical language representation model for biomedical text mining

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36: 0 1234--1240, 2 2020. ISSN 14602059. doi:10.1093/bioinformatics/btz682

work page doi:10.1093/bioinformatics/btz682 2020
[59]

Deduplicating training data makes language models better

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424--8445, Dublin, Ireland, May 2022 a . Association for...

work page doi:10.18653/v1/2022.acl-long.577 2022
[60]

Evaluating human-language model interaction

Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard - Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, Rose E. Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael S. Bernstein, and Percy Liang. Evaluating human-language model interaction. CoRR, abs/2212.09746, 2022 b . doi:10...

work page doi:10.48550/arxiv.2212.09746 2022
[61]

Smith, Zachary Ziegler, Daniel Nadler, Peter Szolovits, Alistair Johnson, and Emily Alsentzer

Eric Lehman, Evan Hernandez, Diwakar Mahajan, Jonas Wulff, Micah J. Smith, Zachary Ziegler, Daniel Nadler, Peter Szolovits, Alistair Johnson, and Emily Alsentzer. Do we still need clinical language models?, 2023. URL https://arxiv.org/abs/2302.08091

work page arXiv 2023
[62]

Levesque, Ernest Davis, and L

Hector J. Levesque, Ernest Davis, and L. Morgenstern. The winograd schema challenge. In International Conference on Principles of Knowledge Representation and Reasoning, 2011

work page 2011
[63]

Limits to depth efficiencies of self-attention

Yoav Levine, Noam Wies, Or Sharir, Hofit Bata, and Amnon Shashua. Limits to depth efficiencies of self-attention. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 22640--22651. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ff4...

work page 2020
[64]

Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art

Patrick Lewis, Myle Ott, Jingfei Du, and Veselin Stoyanov. Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 146--157, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.clinicalnl...

work page doi:10.18653/v1/2020.clinicalnlp-1.17 2020
[65]

Solving Quantitative Reasoning Problems with Language Models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858

work page internal anchor Pith review Pith/arXiv arXiv 2022
[66]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher R \' e , Diana Acosta - Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.09110 2022
[67]

Jurassic-1: Technical details and evaluation

Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1, 2021

work page 2021
[68]

Language models of protein sequences at the scale of evolution enable accurate structure prediction

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, and Alexander Rives. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022. doi:10.1101/2022.07.20.500902. URL https://www.biorxiv.org/content/early/2022...

work page doi:10.1101/2022.07.20.500902 2022
[69]

Autoregressive structured prediction with language models

Tianyu Liu, Yuchen Eleanor Jiang, Nicholas Monath, Ryan Cotterell, and Mrinmaya Sachan. Autoregressive structured prediction with language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 993--1005, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org...

work page 2022
[70]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

work page 2019
[71]

BioGPT: generative pre-trained transformer for biomedical text generation and mining.Brief Bioinform.2022;23(6):bbac409

Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. BioGPT : generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23 0 (6), sep 2022. doi:10.1093/bib/bbac409. URL https://doi.org/10.1093

work page doi:10.1093/bib/bbac409 2022
[72]

Exploring cross-sentence contexts for named entity recognition with BERT

Jouni Luoma and Sampo Pyysalo. Exploring cross-sentence contexts for named entity recognition with BERT . In Proceedings of the 28th International Conference on Computational Linguistics, pages 904--914, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi:10.18653/v1/2020.coling-main.78. URL https://aclantho...

work page doi:10.18653/v1/2020.coling-main.78 2020
[73]

Www'18 open challenge: Financial opinion mining and question answering

Macedo Maia, Siegfried Handschuh, Andr \' e Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www'18 open challenge: Financial opinion mining and question answering. In Pierre - Antoine Champin, Fabien Gandon, Mounia Lalmas, and Panagiotis G. Ipeirotis, editors, Companion of the The Web Conference 2018 on The Web Conference 2018,...

work page doi:10.1145/3184558.3192301 2018
[74]

Korhonen, Jyrki Wallenius, and Pyry Takala

Pekka Malo, Ankur Sinha, Pekka J. Korhonen, Jyrki Wallenius, and Pyry Takala. Good debt or bad debt: Detecting semantic orientations in economic texts. J. Assoc. Inf. Sci. Technol., 65 0 (4): 0 782--796, 2014. doi:10.1002/asi.23062. URL https://doi.org/10.1002/asi.23062

work page doi:10.1002/asi.23062 2014
[75]

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP,

Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y. Lee, Benoît Sagot, and Samson Tan. Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp, 2021. URL https://arxiv.org/abs/2112.10508

work page arXiv 2021
[76]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381--2391, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi:10.186...

work page doi:10.18653/v1/d18-1260 2018
[77]

Recurrent neural network based language model

Tomas Mikolov, Martin Karafi \'a t, Lukas Burget, Jan Cernock \`y , and Sanjeev Khudanpur. Recurrent neural network based language model. In Interspeech, pages 1045--1048. Makuhari, 2010

work page 2010
[78]

A corpus and cloze evaluation for deeper understanding of commonsense stories

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies ...

work page doi:10.18653/v1/n16-1098 2016
[79]

BERT weet: A pre-trained language model for E nglish tweets

Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. BERT weet: A pre-trained language model for E nglish tweets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 9--14, Online, October 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-demos.2. URL https://aclantho...

work page doi:10.18653/v1/2020.emnlp-demos.2 2020
[80]

Adversarial NLI : A New Benchmark for Natural Language Understanding

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial NLI : A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885--4901, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main....

work page doi:10.18653/v1/2020.acl-main.441 2020

Showing first 80 references.