BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin , Ming-Wei Chang , Kenton Lee , Kristina Toutanova

Authors on Pith no claims yet

classification 💻 cs.CL

keywords bertlanguageabsoluteimprovementbidirectionalpointansweringdeep

read the original abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Learning to Unscramble Feynman Loop Integrals with SAILIR
hep-ph 2026-04 unverdicted novelty 8.0

A self-supervised transformer learns to unscramble Feynman integrals for online IBP reduction, delivering bounded memory use on complex two-loop topologies while matching Kira's speed on the hardest cases tested.
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
cs.CL 2024-06 unverdicted novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
cs.AI 2023-06 conditional novelty 8.0

LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
cs.CV 2022-08 unverdicted novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
cs.IR 2021-04 accept novelty 8.0

BEIR is a heterogeneous zero-shot benchmark showing BM25 as a robust baseline while re-ranking and late-interaction models perform best on average at higher cost, with dense and sparse models lagging in generalization.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
cs.CL 2020-12 conditional novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
Measuring Massive Multitask Language Understanding
cs.CY 2020-09 accept novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Reformer: The Efficient Transformer
cs.LG 2020-01 accept novelty 8.0

Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
cs.CL 2019-08 unverdicted novelty 8.0

Sentence-BERT adapts BERT with siamese and triplet networks to produce sentence embeddings for efficient cosine-similarity comparisons, cutting computation time from hours to seconds on similarity search while matchin...
HellaSwag: Can a Machine Really Finish Your Sentence?
cs.CL 2019-05 unverdicted novelty 8.0

HellaSwag dataset shows state-of-the-art models fail commonsense inference tasks that humans solve easily, built via adversarial filtering of distractors.
Passage Re-ranking with BERT
cs.IR 2019-01 unverdicted novelty 8.0

Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.
Convergence of difference inclusions via a diameter criterion
math.OC 2026-05 unverdicted novelty 7.0

A diameter criterion tied to a potential function certifies convergence of difference inclusions, enabling discrete proofs for first-order optimization methods with diminishing steps.
BOOKMARKS: Efficient Active Storyline Memory for Role-playing
cs.CL 2026-05 unverdicted novelty 7.0

BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
cs.CV 2026-05 unverdicted novelty 7.0

CoDAAR creates a unified discrete representation space for multimodal sequences by aligning modality-specific codebooks through index-level semantic consensus, enabling both specificity and cross-modal generalization.
Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
cs.CV 2026-05 unverdicted novelty 7.0

CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state...
Enhancing Healthcare Search Intent Recognition with Query Representation Learning and Session Context
cs.IR 2026-05 unverdicted novelty 7.0

Clustering-based query representations with a novel multi-intent loss and a concordance rate metric improve healthcare search intent classification on two real-world log datasets.
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
cs.LG 2026-05 unverdicted novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
NeuralBench: A Unifying Framework to Benchmark NeuroAI Models
cs.LG 2026-05 conditional novelty 7.0

NeuralBench is a new benchmarking framework for neuroAI models on EEG data that finds foundation models only marginally outperform task-specific ones while many tasks like cognitive decoding stay highly challenging.
Flexible Routing via Uncertainty Decomposition
cs.LG 2026-05 unverdicted novelty 7.0

A router that decomposes uncertainty to flexibly route queries between cheap models and oracles while providing regret bounds and supporting abstention in classification tasks with multiple annotations.
Is She Even Relevant? When BERT Ignores Explicit Gender Cues
cs.CL 2026-05 conditional novelty 7.0

A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent
cs.LG 2026-05 conditional novelty 7.0

Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent ...
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
Conditional generation of antibody sequences with classifier-guided germline-absorbing discrete diffusion
cs.LG 2026-05 unverdicted novelty 7.0

Germline-absorbing discrete diffusion uses the germline sequence as the absorbing state to reduce germline bias in antibody modeling, raising non-germline residue prediction accuracy from 26% to 46% and improving cond...
TCRTransBench: A Comprehensive Benchmark for Bidirectional TCR-Peptide Sequence Generation
q-bio.CB 2026-05 unverdicted novelty 7.0

TCRTransBench provides a new benchmark with bidirectional TCR-peptide generation tasks, a large validated dataset, and metrics to evaluate neural models for immunological sequence modeling.
Harnessing Linguistic Dissimilarity for Language Generalization on Unseen Low-Resource Varieties
cs.CL 2026-05 unverdicted novelty 7.0

A framework with TOPPing source selection and VACAI-Bowl dual-branch model yields 54.62% average improvement in dependency parsing across 10 low-resource varieties.
Deep Graph-Language Fusion for Structure-Aware Code Generation
cs.SE 2026-05 unverdicted novelty 7.0

CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
MedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports
cs.CL 2026-05 unverdicted novelty 7.0

MedStruct-S benchmark shows encoder-only models outperform larger decoder-only ones on key-conditioned QA from noisy OCR clinical reports, with fine-tuned large models winning only when scale is ignored.
Reconstructing conformal field theoretical compositions with Transformers
hep-th 2026-05 unverdicted novelty 7.0

Transformers reconstruct the constituent RCFTs in tensor-product theories from low-energy spectra, reaching 98% accuracy on WZW models and generalizing to larger central charges with few out-of-domain examples.
Math Education Digital Shadows for facilitating learning with LLMs: Math performance, anxiety and confidence in simulated students and AIs
cs.AI 2026-04 unverdicted novelty 7.0

MEDS is a dataset of 28,000 LLM personas performing high-school math tasks alongside psychometric tests and cognitive networks that capture math anxiety, self-efficacy, and confidence to support safer AI tutors.
Identifying and Characterizing Semantic Clones of Solidity Functions
cs.SE 2026-04 unverdicted novelty 7.0

A code-and-comment analysis method detects semantic clones in Solidity functions with 59% overall precision (84% for same-name functions) and 97% recall on 300k contracts, plus LLM summaries for uncommented code.
OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
cs.CV 2026-04 unverdicted novelty 7.0

OmniVTG creates a new large-scale open-world VTG dataset using iterative concept-gap filling and timestamped captioning, paired with a three-stage self-correction CoT paradigm that yields SOTA zero-shot results on fou...
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Provably Secure Steganography Based on List Decoding
cs.CR 2026-04 conditional novelty 7.0

List decoding enables a provably secure steganography scheme with higher embedding capacity for LLMs via candidate sets and suffix matching.
Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider
hep-ph 2026-04 unverdicted novelty 7.0

The work demonstrates masked-token prediction with transformers for model-independent anomaly detection in LHC data, achieving strong results on top-rich BSM signatures like four-top production using VQ-VAE tokenization.
Sparse Contrastive Learning for Content-Based Cold Item Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

SEMCo uses sparse entmax contrastive learning for purely content-based cold-start item recommendation, outperforming standard methods in ranking accuracy.
SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates
cs.LG 2026-04 unverdicted novelty 7.0

LoRA weight updates are spectrally sparse, with 33% of DCT coefficients capturing 90% of energy on average, enabling 10x storage reduction and occasional gains by masking high frequencies.
LASQ: A Low-resource Aspect-based Sentiment Quadruple Extraction Dataset
cs.CL 2026-04 unverdicted novelty 7.0

LASQ is a new quadruple extraction dataset for Uzbek and Uyghur that includes a syntax-aware model showing gains over baselines on the task.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
cs.LG 2026-04 unverdicted novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
cs.CL 2026-04 unverdicted novelty 7.0

Kathleen performs byte-level text classification via recurrent oscillator banks, FFT wavetable encoding, and phase harmonics, matching pretrained baselines on standard benchmarks with 36% fewer parameters.
Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
cs.CL 2026-04 unverdicted novelty 7.0

Kathleen uses recurrent oscillator banks, an efficient wavetable encoder, and phase harmonics to classify text at the byte level with high accuracy and low parameter count.
Graph Topology Information Enhanced Heterogeneous Graph Representation Learning
cs.LG 2026-04 unverdicted novelty 7.0

ToGRL learns high-quality graph structures from raw heterogeneous graphs via a two-stage topology extraction process and prompt tuning, outperforming prior methods on five datasets.
Unlocking Prompt Infilling Capability for Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
cs.DC 2026-04 unverdicted novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
cs.CV 2026-04 unverdicted novelty 7.0

Pretraining on 1M wild videos followed by post-training on curated data yields high-fidelity feedforward 3D avatars that generalize across identities, clothing, and lighting with emergent relightability and loose-garm...
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
cs.AR 2026-03 unverdicted novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
cs.CV 2024-07 unverdicted novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
cs.CL 2024-02 unverdicted novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
cs.CV 2024-01 conditional novelty 7.0

Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
cs.CL 2023-10 conditional novelty 7.0

Fine-tuning aligned LLMs compromises safety guardrails even with minimal adversarial examples or benign data, creating new risks not covered by existing inference-time protections.
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
Steering Language Models With Activation Engineering
cs.CL 2023-08 unverdicted novelty 7.0

Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
cs.RO 2023-04 conditional novelty 7.0

Low-cost imprecise robots achieve 80-90% success on six fine bimanual manipulation tasks using imitation learning with a new Action Chunking with Transformers algorithm trained on only 10 minutes of demonstrations.