Recognition: 2 theorem links
· Lean TheoremBloombergGPT: A Large Language Model for Finance
Pith reviewed 2026-05-13 23:14 UTC · model grok-4.3
The pith
BloombergGPT, a 50 billion parameter model trained on financial plus general data, outperforms prior models on financial tasks while preserving general LLM performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BloombergGPT is a 50 billion parameter model trained on a combined corpus of 363 billion financial tokens and 345 billion general tokens; the resulting model exceeds existing models by substantial margins on financial benchmarks while matching performance on standard general-purpose LLM evaluations.
What carries the argument
The mixed financial-plus-general training corpus used to pretrain the 50 billion parameter transformer model.
If this is right
- Financial NLP tasks such as sentiment analysis and question answering become more accurate with the specialized model.
- The same mixed-dataset recipe can be applied to build other domain-specific models without sacrificing general capability.
- Releasing the training process details allows other groups to replicate or adapt the approach at similar scale.
Where Pith is reading between the lines
- The pattern may extend to other high-stakes domains where both specialized knowledge and general reasoning matter.
- Collecting hundreds of billions of domain tokens appears feasible for organizations with proprietary data pipelines.
- Public release of training logs sets a precedent for transparency that could influence future large-model projects.
Load-bearing premise
The chosen financial data sources and internal benchmarks accurately represent real financial usage and the observed gains arise from the training mix rather than from dataset artifacts or evaluation choices.
What would settle it
An independent evaluation on financial tasks drawn from sources outside the training corpus and the reported benchmarks would show whether the performance advantage holds.
read the original abstract
The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. Additionally, we explain our modeling choices, training process, and evaluation methodology. We release Training Chronicles (Appendix C) detailing our experience in training BloombergGPT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BloombergGPT, a 50 billion parameter language model trained on a mixed dataset of 363 billion financial tokens drawn from Bloomberg sources and 345 billion general-purpose tokens. It claims that this training regime produces a model that outperforms prior models on financial tasks by significant margins while preserving performance on standard general LLM benchmarks. Validation is reported across standard LLM benchmarks, open financial benchmarks, and a proprietary internal benchmark suite; the authors also document modeling choices, the training process, and evaluation methodology, and release Training Chronicles in Appendix C.
Significance. If the performance claims are substantiated, the work would constitute a notable contribution as the first reported large-scale domain-specific LLM for finance. The construction of what is described as one of the largest financial token datasets and the demonstration that mixed-domain training can improve financial-task performance without degrading general capabilities would be of direct interest to both the NLP and FinTech communities. The release of training chronicles adds practical value for reproducibility.
major comments (2)
- [Evaluation] Evaluation section (and abstract): The headline claim that mixed training yields 'significant margins' on financial tasks rests primarily on results from the authors' internal benchmark suite, which the text states 'most accurately reflect our intended usage.' No task definitions, question sources, scoring rubrics, contamination checks, or exclusion criteria are supplied for these benchmarks. Because the largest reported gains are tied to these undisclosed evaluations, independent verification of the central empirical result is impossible and the risk of selection bias or metric-specific artifacts cannot be assessed.
- [Evaluation] § on open financial benchmarks: While the paper references validation on open financial benchmarks, the text supplies no numerical tables, baseline comparisons, or error bars for these results either. The absence of concrete numbers leaves the 'outperforms existing models' assertion without direct quantitative support in the manuscript.
minor comments (2)
- [Abstract] Abstract: The abstract asserts benchmark outperformance but supplies no numerical results, error bars, baseline details, or exclusion criteria, leaving the central claim with limited direct support from the provided text.
- [Appendix C] Appendix C (Training Chronicles): Confirm that the released training log includes sufficient hyper-parameter schedules, hardware details, and any observed instabilities so that the training narrative can be followed by readers.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript on BloombergGPT. The comments on the evaluation sections are well-taken, and we address each point below with clarifications and commitments to revisions where feasible while respecting necessary constraints on proprietary information.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section (and abstract): The headline claim that mixed training yields 'significant margins' on financial tasks rests primarily on results from the authors' internal benchmark suite, which the text states 'most accurately reflect our intended usage.' No task definitions, question sources, scoring rubrics, contamination checks, or exclusion criteria are supplied for these benchmarks. Because the largest reported gains are tied to these undisclosed evaluations, independent verification of the central empirical result is impossible and the risk of selection bias or metric-specific artifacts cannot be assessed.
Authors: We appreciate the referee's emphasis on transparency for the internal benchmarks. These evaluations are constructed from Bloomberg's proprietary data and use cases to best reflect real-world financial applications, which is why full task definitions, question sources, and specific rubrics cannot be disclosed without violating confidentiality. We will revise the manuscript to provide expanded high-level descriptions of task categories (e.g., financial sentiment, report summarization, entity extraction), general scoring methodologies, and contamination mitigation steps that do not reveal sensitive details. This will better contextualize the results and address concerns about selection bias while preserving the proprietary nature of the suite. revision: partial
-
Referee: [Evaluation] § on open financial benchmarks: While the paper references validation on open financial benchmarks, the text supplies no numerical tables, baseline comparisons, or error bars for these results either. The absence of concrete numbers leaves the 'outperforms existing models' assertion without direct quantitative support in the manuscript.
Authors: We agree that the open financial benchmark results should be presented with explicit quantitative support in the main text. The evaluation section includes these comparisons, but to improve clarity and address the concern directly, we will add a dedicated summary table reporting numerical performance metrics on the open benchmarks (including baselines from prior models), along with error bars from multiple evaluation runs where applicable. This revision will provide the direct quantitative evidence requested. revision: yes
- Full release of proprietary internal benchmark task definitions, question sources, and specific instances due to confidentiality and data protection requirements.
Circularity Check
No circularity: empirical training and benchmark evaluation
full rationale
The paper reports construction of a mixed financial+general token dataset, training of a 50B model, and empirical evaluation on standard LLM benchmarks, open financial benchmarks, and internal suites. No equations, derivations, or first-principles claims are present that reduce to self-defined quantities, fitted parameters renamed as predictions, or self-citation chains. Performance margins are reported outcomes of training and testing rather than tautological restatements of inputs. The analysis criteria for circularity (self-definitional, fitted-input-as-prediction, load-bearing self-citation, etc.) are not met.
Axiom & Free-Parameter Ledger
free parameters (2)
- Model parameter count (50 billion)
- Financial-to-general token ratio (363B:345B)
axioms (1)
- domain assumption Standard transformer pretraining on next-token prediction transfers effectively to financial text when mixed with general data.
Forward citations
Cited by 28 Pith papers
-
PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data
Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
-
MeMo: Memory as a Model
MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
-
AutoRedTrader: Autonomous Red Teaming of Trading Agents through Synthetic Misinformation Injection
AutoRedTrader generates synthetic financial misinformation via behavioral bias manipulation and agent feedback to red-team LLM trading agents, reaching 69% exposure and 26.67% attack success on Bitcoin data simulations.
-
From Hypotheses to Factors: Constrained LLM Agents in Cryptocurrency Markets
Constrained LLM agents discover cryptocurrency factors that produce a portfolio with 44.55% annualized return and Sharpe ratio of 1.55 in pure out-of-sample 2024-2026 testing after trading costs.
-
Detecting Corporate AI-Washing via Cross-Modal Semantic Inconsistency Learning
AWASH detects AI-washing via cross-modal inconsistency reasoning on a new trimodal benchmark of 88k corporate disclosure triplets, achieving F1 0.882 with a CMID network that grounds claims against patents and hiring data.
-
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers
Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.
-
Agentic Retrieval-Augmented Generation for Financial Document Question Answering
FinAgent-RAG achieves 76.81-78.46% execution accuracy on financial QA benchmarks by combining contrastive retrieval, program-of-thought code generation, and adaptive strategy routing, outperforming baselines by 5.62-9...
-
Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls
Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
-
RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization
RouteNLP is a closed-loop LLM routing framework using conformal cascading and distillation co-optimization that cut inference costs by 58% in an 8-week enterprise deployment while preserving 91% acceptance and high qu...
-
Cross-Stock Predictability via LLM-Augmented Semantic Networks
LLM filtering of embedding-based stock networks raises long-short Sharpe ratio from 0.742 to 0.820 and cuts max drawdown from -10.47% to -7.85% in 2011-2019 S&P 500 backtests.
-
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
-
MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model
MFMDQwen is the first open-source LLM for multilingual financial misinformation detection, backed by a new instruction dataset and benchmark on which it outperforms other open-source models.
-
SenseAI: A Human-in-the-Loop Dataset for RLHF-Aligned Financial Sentiment Reasoning
SenseAI is a human-in-the-loop financial sentiment dataset with reasoning processes and market outcomes that reveals predictable LLM error patterns like Latent Reasoning Drift for RLHF alignment.
-
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
-
PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage
PolySwarm aggregates predictions from 50 LLM personas for Polymarket trading using Bayesian combination and divergence metrics, outperforming single models in calibration while adding latency arbitrage via CEX price models.
-
CGCMA: Conditionally-Gated Cross-Modal Attention for Event-Conditioned Asynchronous Fusion
CGCMA separates text-conditioned grounding from lag-aware trust gating to fuse asynchronous price and web data, yielding the highest Sharpe ratio of +0.449 on a new crypto news corpus.
-
Jailbreaking Black Box Large Language Models in Twenty Queries
PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
-
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
-
Semantic State Abstraction Interfaces for LLM-Augmented Portfolio Decisions: Multi-Axis News Decomposition and RL Diagnostics
SSAI maps news into four factors (sentiment, risk, confidence, volatility) for trading, but factor portfolios, ridge models, and RL agents show no reliable edge over baselines after coverage controls and costs.
-
FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification
FinGround reduces financial hallucinations by 68% over baselines in retrieval-equalized tests through atomic claim verification and grounding, with an 8B model retaining 91.4% F1 at low cost.
-
When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies
LLM features optimized for high information coefficient with returns do not reliably improve PPO trading policies under distribution shifts, where price-only or macro baselines remain more robust.
-
PRAGMA: Revolut Foundation Model
PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and ...
-
CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization
CROP achieves 80.6% token reduction on GSM8K, LogiQA and BIG-Bench Hard with only nominal accuracy decline by regularizing automatic prompt optimization with response-length feedback.
-
FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures
FinReporting builds a canonical ontology for income, balance, and cash flow statements and uses constrained LLM agents as verifiers to produce localized, auditable reports from US, Japanese, and Chinese filings.
-
A Multi-Agent Orchestration Framework for Venture Capital Due Diligence
A multi-agent orchestration framework automates VC due diligence using LLMs, web retrieval, and a programmatic pipeline to extract and parse official Greek business registry filings while flagging data gaps.
-
AgenticAITA: A Proof-Of-Concept About Deliberative Multi-Agent Reasoning for Autonomous Trading Systems
AgenticAITA proposes a training-free multi-agent LLM framework for autonomous trading using a deliberative pipeline, Z-score triggers, and safety gates, shown to run correctly in a five-day live dry-run with 157 invocations.
-
ComplianceNLP: Knowledge-Graph-Augmented RAG for Multi-Framework Regulatory Gap Detection
ComplianceNLP integrates knowledge-graph-augmented RAG, multi-task legal text extraction, and gap analysis to detect regulatory compliance gaps, reporting 87.7 F1 and real-world efficiency gains over GPT-4o baselines.
-
Developing an ESG-Oriented Large Language Model through ESG Practices
ESG-adapted versions of Qwen-3-4B using LoRA and IRM outperform the base model and Llama-3/Gemma-3 baselines on generative ESG question-answering tasks.
Reference graph
Works this paper leans on
-
[1]
Dogu Araci. Finbert: Financial sentiment analysis with pre-trained language models. arXiV preprint arXiV:1908.10063, 2019
-
[2]
PLATO - XL : Exploring the large-scale pre-training of dialogue generation
Siqi Bao, Huang He, Fan Wang, Hua Wu, Haifeng Wang, Wenquan Wu, Zhihua Wu, Zhen Guo, Hua Lu, Xinxian Huang, Xin Tian, Xinchao Xu, Yingzhan Lin, and Zheng-Yu Niu. PLATO - XL : Exploring the large-scale pre-training of dialogue generation. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 107--118, Online only, November 2...
work page 2022
-
[3]
S ci BERT : A pretrained language model for scientific text
Iz Beltagy, Kyle Lo, and Arman Cohan. S ci BERT : A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615--3620, Hong Kong, China, November 2019. Association for Computation...
-
[4]
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610--623, 2021
work page 2021
-
[5]
The fifth PASCAL recognizing textual entailment challenge
Luisa Bentivogli, Bernardo Magnini, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. The fifth PASCAL recognizing textual entailment challenge. In Proceedings of the Second Text Analysis Conference, TAC 2009, Gaithersburg, Maryland, USA, November 16-17, 2009 . NIST , 2009. URL https://tac.nist.gov/publications/2009/additional.papers/RTE5\_overview.proce...
work page 2009
-
[6]
The values encoded in machine learning research
Abeba Birhane, Pratyusha Kalluri, Dallas Card, William Agnew, Ravit Dotan, and Michelle Bao. The values encoded in machine learning research. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 173--184, 2022
work page 2022
-
[7]
PIQA: reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in...
work page 2020
-
[8]
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , March 2021. URL https://doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata
-
[9]
GPT - N eo X -20 B : An open-source autoregressive language model
Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, Usvsn Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT - N eo X -20 B : An open-source autoregressive language model. In Proceedings of BigScience E...
- [10]
-
[11]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Byte pair encoding is suboptimal for language model pretraining
Kaj Bostrom and Greg Durrett. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617--4624, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.414. URL https://aclanthology.org/2020.findings-emnlp.414
-
[13]
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning ( EMNLP - C o NLL ) , pages 858--867, Prague, Czech Republic, June 2007. Association for Computat...
work page 2007
-
[14]
Class-based n-gram models of natural language
Peter F Brown, Vincent J Della Pietra, Peter V Desouza, Jennifer C Lai, and Robert L Mercer. Class-based n-gram models of natural language. Computational linguistics, 18 0 (4): 0 467--480, 1992
work page 1992
-
[15]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...
work page 2020
-
[16]
Brown, Dawn Xiaodong Song, \'U lfar Erlingsson, Alina Oprea, and Colin Raffel
Nicholas Carlini, Florian Tram \`e r, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Xiaodong Song, \'U lfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. In USENIX Security Symposium, 2020
work page 2020
-
[17]
Quantifying Memorization Across Neural Language Models
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models, 2022. URL https://arxiv.org/abs/2202.07646
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winte...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
Training Deep Nets with Sublinear Memory Cost
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiV preprint arXiV:1604.06174, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[20]
F in QA : A dataset of numerical reasoning over financial data
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. F in QA : A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697--3711, Online and Punta Can...
-
[21]
Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang. C onv F in QA : Exploring the chain of numerical reasoning in conversational finance question answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6279--6292, Abu Dhabi, United Arab Emirates, December 2022. Assoc...
work page 2022
-
[22]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope, Ja...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. B ool Q : Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Pape...
-
[24]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiV, abs/1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
The pascal recognising textual entailment challenge
Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, 2007
work page 2007
-
[26]
The commitmentbank: Investigating projection in naturally occurring discourse
Marie-Catherine De Marneffe, Mandy Simons, and Judith Tonhauser. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, pages 107--124, 2019
work page 2019
-
[27]
Bernice: A multilingual pre-trained encoder for T witter
Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Philip Resnik, and Mark Dredze. Bernice: A multilingual pre-trained encoder for T witter. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6191--6205, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL ht...
work page 2022
-
[28]
8-bit optimizers via block-wise quantization
Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. In International Conference on Learning Representations, 2022
work page 2022
-
[29]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171--4186, Minneap...
-
[30]
Documenting large webtext corpora: A case study on the colossal clean crawled corpus
Jesse Dodge, Maarten Sap, Ana Marasovi \'c , William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286--1305, Online and Punta Cana, Dominica...
-
[31]
How twitter is changing the nature of financial news discovery
Mark Dredze, Prabhanjan Kambadur, Gary Kazantsev, Gideon Mann, and Miles Osborne. How twitter is changing the nature of financial news discovery. In proceedings of the second international workshop on data science for macro-modeling, pages 1--5, 2016
work page 2016
-
[32]
Ingrid E Fisher, Margaret R Garnsey, and Mark E Hughes. Natural language processing in accounting, auditing and finance: A synthesis of the literature with a roadmap for future research. Intelligent Systems in Accounting, Finance and Management, 23 0 (3): 0 157--214, 2016
work page 2016
-
[33]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2021. URL https://arxiv.org/abs/2101.00027
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[34]
Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text, 2022. URL https://arxiv.org/abs/2202.06935
-
[35]
The third PASCAL recognizing textual entailment challenge
Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL - PASCAL Workshop on Textual Entailment and Paraphrasing , pages 1--9, Prague, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/W07-1401
work page 2007
-
[36]
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nich...
work page internal anchor Pith review arXiv 2022
-
[37]
Gordon, Zornitsa Kozareva, and Melissa Roemmele
Andrew S. Gordon, Zornitsa Kozareva, and Melissa Roemmele. Semeval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In International Workshop on Semantic Evaluation, 2011
work page 2012
-
[38]
News summarization and evaluation in the era of gpt-3, 2022
Tanya Goyal, Junyi Jessy Li, and Greg Durrett. News summarization and evaluation in the era of gpt-3, 2022. URL https://arxiv.org/abs/2209.12356
-
[39]
Suchin Gururangan, Ana Marasovi \'c , Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don ' t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342--8360, Online, July 2020. Association for Computational Linguistics. doi:...
-
[40]
The second pascal recognising textual entailment challenge
R Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, volume 7, 2006
work page 2006
-
[41]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiV preprint arXiV:1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[42]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ
work page 2021
-
[43]
Query-key normalization for transformers
Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4246--4253, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.379. URL https://aclanthology.org/2020.find...
-
[44]
Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer. arXiV preprint arXiV:2102.01293, 2021
-
[45]
An empirical analysis of compute-optimal large language model training
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, and Laur...
work page 2022
-
[46]
Universal Language Model Fine-tuning for Text Classification
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328--339, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1031. URL https://aclanthology.o...
-
[47]
Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiV, 4 2019. URL http://arxiv.org/abs/1904.05342
-
[48]
Continuous speech recognition by statistical methods
Frederick Jelinek. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64 0 (4): 0 532--556, 1976
work page 1976
-
[49]
Data governance in the age of large-scale data-driven language technology
Yacine Jernite, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim Masoud, Valentin Danchev, Samson Tan, Alexandra Sasha Luccioni, Nishant Subramani, Isaac Johnson, Gerard Dupont, Jesse Dodge, Kyle Lo, Zeerak Talat, Dragomir Radev, Aaron Gokaslan, Somaieh Nikpoor, Peter Henderson, Rishi Bommasani, and Margaret Mitchell. Data governance in the age of large-s...
-
[50]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiV, 1 2020. URL http://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[51]
Amazon sagemaker model parallelism: A general and flexible framework for large model training
Can Karakus, Rahul Huilgol, Fei Wu, Anirudh Subramanian, Cade Daniel, Derya Cavdar, Teng Xu, Haohan Chen, Arash Rahnama, and Luis Quintela. Amazon sagemaker model parallelism: A general and flexible framework for large model training. arXiV preprint arXiV:2111.05972, 2021
-
[52]
Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences
Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 25...
-
[53]
Reducing activation recomputation in large transformer models, 2022
Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models, 2022. URL https://arxiv.org/abs/2205.05198
-
[54]
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66--75, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1007. URL https://...
-
[55]
Taku Kudo and John Richardson. S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66--71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi:10....
work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
-
[56]
RACE : Large-scale R e A ding comprehension dataset from examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE : Large-scale R e A ding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785--794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/D17-1082. URL htt...
-
[57]
Teven Le Scao, Thomas Wang, Daniel Hesslow, Stas Bekman, M Saiful Bari, Stella Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, Ofir Press, Colin Raffel, Victor Sanh, Sheng Shen, Lintang Sutawika, Jaesung Tae, Zheng Xin Yong, Julien Launay, and Iz Beltagy. What language model to train if you have one million GPU hours? In Findings of the Associati...
work page 2022
-
[58]
Biobert: A pre-trained biomedical language representation model for biomedical text mining
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36: 0 1234--1240, 2 2020. ISSN 14602059. doi:10.1093/bioinformatics/btz682
-
[59]
Deduplicating training data makes language models better
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424--8445, Dublin, Ireland, May 2022 a . Association for...
-
[60]
Evaluating human-language model interaction
Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard - Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, Rose E. Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael S. Bernstein, and Percy Liang. Evaluating human-language model interaction. CoRR, abs/2212.09746, 2022 b . doi:10...
-
[61]
Smith, Zachary Ziegler, Daniel Nadler, Peter Szolovits, Alistair Johnson, and Emily Alsentzer
Eric Lehman, Evan Hernandez, Diwakar Mahajan, Jonas Wulff, Micah J. Smith, Zachary Ziegler, Daniel Nadler, Peter Szolovits, Alistair Johnson, and Emily Alsentzer. Do we still need clinical language models?, 2023. URL https://arxiv.org/abs/2302.08091
-
[62]
Hector J. Levesque, Ernest Davis, and L. Morgenstern. The winograd schema challenge. In International Conference on Principles of Knowledge Representation and Reasoning, 2011
work page 2011
-
[63]
Limits to depth efficiencies of self-attention
Yoav Levine, Noam Wies, Or Sharir, Hofit Bata, and Amnon Shashua. Limits to depth efficiencies of self-attention. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 22640--22651. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ff4...
work page 2020
-
[64]
Patrick Lewis, Myle Ott, Jingfei Du, and Veselin Stoyanov. Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 146--157, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.clinicalnl...
-
[65]
Solving Quantitative Reasoning Problems with Language Models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[66]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher R \' e , Diana Acosta - Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong,...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.09110 2022
-
[67]
Jurassic-1: Technical details and evaluation
Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1, 2021
work page 2021
-
[68]
Language models of protein sequences at the scale of evolution enable accurate structure prediction
Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, and Alexander Rives. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022. doi:10.1101/2022.07.20.500902. URL https://www.biorxiv.org/content/early/2022...
-
[69]
Autoregressive structured prediction with language models
Tianyu Liu, Yuchen Eleanor Jiang, Nicholas Monath, Ryan Cotterell, and Mrinmaya Sachan. Autoregressive structured prediction with language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 993--1005, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org...
work page 2022
-
[70]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7
work page 2019
-
[71]
Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. BioGPT : generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23 0 (6), sep 2022. doi:10.1093/bib/bbac409. URL https://doi.org/10.1093
-
[72]
Exploring cross-sentence contexts for named entity recognition with BERT
Jouni Luoma and Sampo Pyysalo. Exploring cross-sentence contexts for named entity recognition with BERT . In Proceedings of the 28th International Conference on Computational Linguistics, pages 904--914, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi:10.18653/v1/2020.coling-main.78. URL https://aclantho...
-
[73]
Www'18 open challenge: Financial opinion mining and question answering
Macedo Maia, Siegfried Handschuh, Andr \' e Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www'18 open challenge: Financial opinion mining and question answering. In Pierre - Antoine Champin, Fabien Gandon, Mounia Lalmas, and Panagiotis G. Ipeirotis, editors, Companion of the The Web Conference 2018 on The Web Conference 2018,...
-
[74]
Korhonen, Jyrki Wallenius, and Pyry Takala
Pekka Malo, Ankur Sinha, Pekka J. Korhonen, Jyrki Wallenius, and Pyry Takala. Good debt or bad debt: Detecting semantic orientations in economic texts. J. Assoc. Inf. Sci. Technol., 65 0 (4): 0 782--796, 2014. doi:10.1002/asi.23062. URL https://doi.org/10.1002/asi.23062
-
[75]
Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP,
Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y. Lee, Benoît Sagot, and Samson Tan. Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp, 2021. URL https://arxiv.org/abs/2112.10508
-
[76]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381--2391, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi:10.186...
-
[77]
Recurrent neural network based language model
Tomas Mikolov, Martin Karafi \'a t, Lukas Burget, Jan Cernock \`y , and Sanjeev Khudanpur. Recurrent neural network based language model. In Interspeech, pages 1045--1048. Makuhari, 2010
work page 2010
-
[78]
A corpus and cloze evaluation for deeper understanding of commonsense stories
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies ...
-
[79]
BERT weet: A pre-trained language model for E nglish tweets
Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. BERT weet: A pre-trained language model for E nglish tweets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 9--14, Online, October 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-demos.2. URL https://aclantho...
-
[80]
Adversarial NLI : A New Benchmark for Natural Language Understanding
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial NLI : A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885--4901, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.