BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, Luke Zettlemoyer · 2020 · DOI 10.18653/v1/2020.acl-main.703

42 Pith papers cite this work. Polarity classification is still indexing.

42 Pith papers citing it

open at publisher browse 42 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Locating and Editing Factual Associations in GPT

cs.CL · 2022-02-10 · accept · novelty 8.0

Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.

SimCSE: Simple Contrastive Learning of Sentence Embeddings

cs.CL · 2021-04-18 · conditional · novelty 8.0

SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.

Where Do Prompt Perturbations Break Generation? A Segment-Level View of Robustness in LoRA-Tuned Language Models

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

S²R² improves robustness of LoRA-tuned LLMs to prompt perturbations by penalizing semantic-segment drift while preserving clean performance and cross-dataset transfer.

A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.

EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

EVENT5Ws is a new large-scale, manually verified open-domain event extraction dataset that benchmarks LLMs and demonstrates cross-context generalization.

Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

cs.CL · 2025-05-24 · unverdicted · novelty 7.0

Smoothie performs diffusion by smoothing token embeddings based on semantic similarity, outperforming prior diffusion models on sequence-to-sequence and unconditional text generation tasks.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

The Power of Scale for Parameter-Efficient Prompt Tuning

cs.CL · 2021-04-18 · unverdicted · novelty 7.0

Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.

Prefix-Tuning: Optimizing Continuous Prompts for Generation

cs.CL · 2021-01-01 · conditional · novelty 7.0

Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.

Longformer: The Long-Document Transformer

cs.CL · 2020-04-10 · accept · novelty 7.0

Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.

Behavioral and Representational Evidence of Binomial Ordering Preferences in Large Language Models

cs.CL · 2026-06-19 · unverdicted · novelty 6.0

LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.

ZAS-SQL: Distilling Rules from Failures for Zero-Shot Text-to-SQL

cs.CL · 2026-06-06 · unverdicted · novelty 6.0

ZAS-SQL distills rules from zero-shot Text-to-SQL failures to reach 87.2-88.6% execution accuracy on Spider, new zero-shot SOTA surpassing some GPT-4 few-shot and fine-tuned baselines.

Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

MIR improves validation loss in repeated-data pretraining and SoftQ fits data-constrained scaling experiments better than additive laws, equating MIR gains to roughly 1.3 times more unique data.

Towards Understanding Self-Pretraining for Sequence Classification

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

Vocabulary adaptation via targeted token addition and replacement improves semantic similarity, domain word usage, and training efficiency for LLM summarization in legal and medical domains.

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.

Exploiting Pre-trained Encoder-Decoder Transformers for Sequence-to-Sequence Constituent Parsing

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Pre-trained encoder-decoder transformers fine-tuned for sequence-to-sequence constituent parsing outperform prior seq2seq models and compete with specialized parsers on continuous treebanks.

Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation

cs.CR · 2026-04-23 · unverdicted · novelty 6.0

Hybrid DP with LLM or NER preprocessing significantly improves the privacy-utility trade-off for Dutch clinical note de-identification compared to standalone DP.

SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.

Why These Documents? Explainable Generative Retrieval with Hierarchical Category Paths

cs.IR · 2024-11-08 · unverdicted · novelty 6.0

HyPE improves generative retrieval by first generating hierarchical category paths for explainability and then using path-aware ranking to boost performance.

Demystifying CLIP Data

cs.CV · 2023-09-28 · accept · novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

cs.CL · 2023-05-23 · conditional · novelty 6.0

UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

cs.CL · 2023-05-03 · conditional · novelty 6.0

Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.

citing papers explorer

Showing 42 of 42 citing papers.

Locating and Editing Factual Associations in GPT cs.CL · 2022-02-10 · accept · none · ref 25
Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.
SimCSE: Simple Contrastive Learning of Sentence Embeddings cs.CL · 2021-04-18 · conditional · none · ref 133
SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.
Where Do Prompt Perturbations Break Generation? A Segment-Level View of Robustness in LoRA-Tuned Language Models cs.CL · 2026-05-02 · unverdicted · none · ref 54
S²R² improves robustness of LoRA-tuned LLMs to prompt perturbations by penalizing semantic-segment drift while preserving clean performance and cross-dataset transfer.
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis cs.CL · 2026-05-02 · unverdicted · none · ref 33
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents cs.CL · 2026-04-23 · unverdicted · none · ref 61
EVENT5Ws is a new large-scale, manually verified open-domain event extraction dataset that benchmarks LLMs and demonstrates cross-context generalization.
Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation cs.CL · 2025-05-24 · unverdicted · none · ref 28
Smoothie performs diffusion by smoothing token embeddings based on semantic similarity, outperforming prior diffusion models on sequence-to-sequence and unconditional text generation tasks.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 217
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 223
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
The Power of Scale for Parameter-Efficient Prompt Tuning cs.CL · 2021-04-18 · unverdicted · none · ref 28
Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.
Prefix-Tuning: Optimizing Continuous Prompts for Generation cs.CL · 2021-01-01 · conditional · none · ref 65
Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
Longformer: The Long-Document Transformer cs.CL · 2020-04-10 · accept · none · ref 103
Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
Behavioral and Representational Evidence of Binomial Ordering Preferences in Large Language Models cs.CL · 2026-06-19 · unverdicted · none · ref 115
LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.
ZAS-SQL: Distilling Rules from Failures for Zero-Shot Text-to-SQL cs.CL · 2026-06-06 · unverdicted · none · ref 37
ZAS-SQL distills rules from zero-shot Text-to-SQL failures to reach 87.2-88.6% execution accuracy on Spider, new zero-shot SOTA surpassing some GPT-4 few-shot and fine-tuned baselines.
Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws cs.LG · 2026-06-05 · unverdicted · none · ref 5
MIR improves validation loss in repeated-data pretraining and SoftQ fits data-constrained scaling experiments better than additive laws, equating MIR gains to roughly 1.3 times more unique data.
Towards Understanding Self-Pretraining for Sequence Classification cs.LG · 2026-05-20 · unverdicted · none · ref 79
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization cs.CL · 2026-05-17 · unverdicted · none · ref 17
Vocabulary adaptation via targeted token addition and replacement improves semantic similarity, domain word usage, and training efficiency for LLM summarization in legal and medical domains.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents cs.CL · 2026-05-14 · unverdicted · none · ref 152
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
Exploiting Pre-trained Encoder-Decoder Transformers for Sequence-to-Sequence Constituent Parsing cs.CL · 2026-05-13 · unverdicted · none · ref 27
Pre-trained encoder-decoder transformers fine-tuned for sequence-to-sequence constituent parsing outperform prior seq2seq models and compete with specialized parsers on continuous treebanks.
Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation cs.CR · 2026-04-23 · unverdicted · none · ref 10
Hybrid DP with LLM or NER preprocessing significantly improves the privacy-utility trade-off for Dutch clinical note de-identification compared to standalone DP.
SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization cs.CL · 2026-04-21 · unverdicted · none · ref 15
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.
Why These Documents? Explainable Generative Retrieval with Hierarchical Category Paths cs.IR · 2024-11-08 · unverdicted · none · ref 18
HyPE improves generative retrieval by first generating hierarchical category paths for explainability and then using path-aware ranking to boost performance.
Demystifying CLIP Data cs.CV · 2023-09-28 · accept · none · ref 168
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations cs.CL · 2023-05-23 · conditional · none · ref 65
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes cs.CL · 2023-05-03 · conditional · none · ref 14
Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning cs.AI · 2023-01-31 · conditional · none · ref 30
The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.
Efficient Training of Language Models to Fill in the Middle cs.CL · 2022-07-28 · unverdicted · none · ref 122
Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation cs.CL · 2021-09-02 · conditional · none · ref 74
CodeT5 adds identifier-aware pre-training and bimodal dual generation to a T5-style encoder-decoder, yielding better results on defect detection, clone detection, and code-to-text, text-to-code, and code-to-code tasks than prior encoder-only or decoder-only models.
ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation cs.CV · 2026-06-10 · unverdicted · none · ref 12
ECA introduces continual alignment with MoQ, FeDEx, and DR for exemplar-free incremental learning in open-ended image-to-text generation, evaluated on four new benchmarks showing reduced forgetting.
Explaining Black-Box Language Models: Learning to Optimize Linguistically-Structured Word Subsets cs.AI · 2026-06-07 · unverdicted · none · ref 33
Amortized optimization with policy gradients and graph knowledge selects informative word subsets to explain black-box DLM outputs.
Enhancing Factuality through Consensus and Consistency in Summarization Using Minimum Bayes Risk Decoding cs.CL · 2026-05-28 · unverdicted · none · ref 18
ConSUM reranks candidate summaries using MBR consensus and source-consistency metrics to improve factuality over standard generation or reranking baselines.
AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System cs.DC · 2026-05-22 · unverdicted · none · ref 21
AlignedServe uses prefix-aware batching, large CPU in-flight request pools, batch scheduling, and GPU-to-GPU KV prefetching to raise decoding throughput up to 1.98x and cut latency up to 7.4x versus prior serving systems.
Towards Foundation Models for Relational Databases with Language Models and Graph Neural Networks cs.DB · 2026-05-15 · unverdicted · none · ref 19
A BART-GraphSAGE hybrid achieves ROC-AUC 67.40 on one RelBench task, competitive with LightGBM but still behind specialized relational deep learning and foundation models.
Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets cs.CL · 2026-05-06 · unverdicted · none · ref 28
An evidence-based model generates queries from query-free datasets, yielding summaries with competitive ROUGE scores to those using original queries.
From 'Here' to 'There': Exploring Proximity Semantics in Multimodal Data Exploration cs.HC · 2026-05-04 · unverdicted · none · ref 59
A user study with 20 participants shows that closeness between sketches, annotations, and language in a shared space helps disambiguate multimodal queries, leading to the concept of proximity semantics for data exploration systems.
A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents cs.CL · 2026-04-23 · unverdicted · none · ref 217
MODEE is a multimodal system that integrates graphs with LLM embeddings to outperform prior open-domain event extraction methods on large datasets.
Beyond Overlap Metrics: Rewarding Reasoning and Preferences for Faithful Multi-Role Dialogue Summarization cs.CL · 2026-04-19 · unverdicted · none · ref 18
A reasoning-distillation plus dual-reward GRPO method for multi-role dialogue summarization matches ROUGE and BERTScore baselines while improving factual faithfulness and preference alignment on CSDS and SAMSum.
Remember what you did so you know what to do next cs.CL · 2023-10-30 · unverdicted · none · ref 16
GPT-J with full action history achieves 3.5x improvement over RL in ScienceWorld and matches a two-stage system using 29x larger models.
Semantic Grading of Written Answers in Low-Resource Language Bangla Using a Fine-Tuned Lightweight Language Model cs.CL · 2026-06-10 · unverdicted · none · ref 32
QLoRA-tuned Qwen3-8B is fine-tuned on synthetic Bangla-English data to semantically grade written answers, reporting RoRa 0.819 and human agreement rho 0.936.
Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit cs.CL · 2026-06-02 · unverdicted · none · ref 22
Fine-tuned RoBERTa achieves 0.62 macro-F1 on 900 Reddit comments, outperforming best zero-shot LLM at 0.50, with largest gap on detecting belief propagation.
Ideological discrepancy between publishers and news content is linked with audience engagement and consensus on Facebook cs.SI · 2026-05-02 · unverdicted · none · ref 48
Ideological discrepancy between publishers and news content on Facebook is associated with nonlinear declines in audience consensus at extremes of alignment and mismatch, plus higher toxicity under mismatch, during a Brazilian election.
A Tree-of-Thoughts Inspired Hybrid Approach for Legal Case Judgement Summarization using LLMs cs.CL · 2026-06-26 · unverdicted · none · ref 272
A tree-of-thoughts inspired hybrid extractive-abstractive LLM prompt yields better legal case judgment summaries than standard extractive or abstractive prompts.
TextEconomizer: Enhancing Lossy Text Compression with Denoising Transformers and Entropy Coding cs.CL · 2026-06-06 · unverdicted · none · ref 32
Presents TextEconomizer, a transformer-based encoder-decoder for lossy text compression claiming 5.39x ratio, near-perfect semantic quality via standard metrics, and 153x fewer parameters than comparables.

BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer