Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.
BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
42 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.
S²R² improves robustness of LoRA-tuned LLMs to prompt perturbations by penalizing semantic-segment drift while preserving clean performance and cross-dataset transfer.
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
EVENT5Ws is a new large-scale, manually verified open-domain event extraction dataset that benchmarks LLMs and demonstrates cross-context generalization.
Smoothie performs diffusion by smoothing token embeddings based on semantic similarity, outperforming prior diffusion models on sequence-to-sequence and unconditional text generation tasks.
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.
Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.
ZAS-SQL distills rules from zero-shot Text-to-SQL failures to reach 87.2-88.6% execution accuracy on Spider, new zero-shot SOTA surpassing some GPT-4 few-shot and fine-tuned baselines.
MIR improves validation loss in repeated-data pretraining and SoftQ fits data-constrained scaling experiments better than additive laws, equating MIR gains to roughly 1.3 times more unique data.
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
Vocabulary adaptation via targeted token addition and replacement improves semantic similarity, domain word usage, and training efficiency for LLM summarization in legal and medical domains.
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
Pre-trained encoder-decoder transformers fine-tuned for sequence-to-sequence constituent parsing outperform prior seq2seq models and compete with specialized parsers on continuous treebanks.
Hybrid DP with LLM or NER preprocessing significantly improves the privacy-utility trade-off for Dutch clinical note de-identification compared to standalone DP.
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.
HyPE improves generative retrieval by first generating hierarchical category paths for explainability and then using path-aware ranking to boost performance.
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
citing papers explorer
-
Why These Documents? Explainable Generative Retrieval with Hierarchical Category Paths
HyPE improves generative retrieval by first generating hierarchical category paths for explainability and then using path-aware ranking to boost performance.