super hub Mixed citations

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Danqi Chen, Jingfei Du, Mandar Joshi, Myle Ott, Naman Goyal, Yinhan Liu · 2019 · cs.CL · arXiv 1907.11692

Mixed citation behavior. Most common role is background (65%).

459 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 459 citing papers more from Danqi Chen arXiv PDF

abstract

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 48 method 12 baseline 5 dataset 3

citation-polarity summary

background 44 use method 12 baseline 5 support 3 use dataset 3 unclear 1

claims ledger

abstract Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it

authors

Danqi Chen Jingfei Du Mandar Joshi Myle Ott Naman Goyal Yinhan Liu

co-cited works

representative citing papers

FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes

cs.CL · 2026-06-01 · conditional · novelty 8.0

FigSIM is the first annotated dataset for fine-grained suicide severity and figurative language in suicide memes, accompanied by benchmarks on 16 unimodal and multimodal models.

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

cs.IR · 2024-03-06 · unverdicted · novelty 8.0

BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

cs.CL · 2022-02-25 · accept · novelty 8.0

Randomly replacing labels in in-context demonstrations barely hurts performance, showing that label space, input distribution, and sequence format drive in-context learning more than ground-truth labels.

SimCSE: Simple Contrastive Learning of Sentence Embeddings

cs.CL · 2021-04-18 · conditional · novelty 8.0

SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

Measuring Massive Multitask Language Understanding

cs.CY · 2020-09-07 · accept · novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

cs.CL · 2020-03-23 · conditional · novelty 8.0

ELECTRA replaces masked language modeling with replaced token detection, yielding contextual representations that outperform BERT at equal compute and match larger models like RoBERTa with far less compute.

REALM: Retrieval-Augmented Language Model Pre-Training

cs.CL · 2020-02-10 · accept · novelty 8.0

REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

cs.CL · 2019-08-27 · unverdicted · novelty 8.0

Sentence-BERT adapts BERT with siamese and triplet networks to produce sentence embeddings for efficient cosine-similarity comparisons, cutting computation time from hours to seconds on similarity search while matching BERT accuracy.

Probing Memorization of Tabular In-Context Learning

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

A new probing framework detects moderate parametric memorization signals in tabular in-context learning models under single-task fine-tuning, strongest on low-cardinality tasks, but signals largely disappear under realistic training.

FlexTab: A Flexible Encoder-Decoder Architecture for In-Context Learning Across Diverse Tabular Tasks

cs.LG · 2026-06-29 · unverdicted · novelty 7.0 · 2 refs

FlexTab shows a shared encoder with task-specific decoders trained on unlabeled tables can achieve SOTA on classification, regression, anomaly detection and entity matching while staying competitive on relational entity classification.

PromptGNN-sim: Deep Fusion and Alignment of GNN and LLMs for Text-Attributed Graph Learning

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PromptGNN-sim uses GAT-based semantically aware neighborhood selection and structure-aware LLM prompts with bi-directional contrastive alignment to outperform prior GNN, LLM, and fusion methods on text-attributed graph datasets.

Anisotropy Decides Cosine vs. Rank Metrics for Text Embeddings

cs.CL · 2026-06-28 · conditional · novelty 7.0

Anisotropy, quantified by dominant-dimension variance fraction, determines the best parameter-free similarity metric for text embeddings, with rank-based metrics gaining ~20% relative where cosine is weakest.

Continuous Language Diffusion as a Decoder-Interface Problem

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

Continuous language diffusion works by entering high-margin decoder basins where frozen T5 embeddings recover 93-96% of native decisions and linear readouts reach 97.9% agreement, implying models should be evaluated as representation-decoder systems.

Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

DEPO formulates detector-evasive paraphrasing as a constrained MDP and solves it via Lagrangian primal-dual RL with GRPO-style updates to achieve evasion while satisfying a semantic-preservation constraint.

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.

GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

GRUFF dataset shows LLMs agree well with masculine and feminine German pronouns but fail on neopronouns and distractors, with occupational stereotypes poorly correlated across cases.

Towards Cost-effective LLMs Routing with Batch Prompting

cs.DB · 2026-05-27 · unverdicted · novelty 7.0

RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.

Less Effort, Shorter Proofs: Reinforcement Learning for Security Protocol Analysis in Tamarin

cs.CR · 2026-05-22 · unverdicted · novelty 7.0

An RL-guided MCTS proof search for Tamarin finds more and shorter proofs than standard search across 16 protocol models.

Where Does Authorship Signal Emerge in Encoder-Based Language Models?

cs.CL · 2026-05-19 · conditional · novelty 7.0

Different scoring mechanisms cause encoder-based authorship attribution models to consolidate authorship signals at different layers, as shown by causal interventions and gradient analysis.

Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain on hard examples across eight datasets and seven models.

Scaling Laws from Sequential Feature Recovery: A Solvable Hierarchical Model

stat.ML · 2026-05-14 · accept · novelty 7.0

A solvable hierarchical model with power-law feature strengths yields explicit power-law scaling of prediction error through sequential recovery of latent directions by a layer-wise spectral algorithm.

citing papers explorer

Showing 30 of 30 citing papers after filters.

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders cs.IR · 2024-03-06 · unverdicted · none · ref 25 · internal anchor
BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.
Tulu 3: Pushing Frontiers in Open Language Model Post-Training cs.CL · 2024-11-22 · accept · none · ref 2 · internal anchor
Tulu 3 provides open SOTA post-trained LLMs with a novel RLVR algorithm and complete reproducibility artifacts that surpass Llama 3.1 instruct, Qwen 2.5, Mistral, GPT-4o-mini, and Claude 3.5-Haiku on benchmarks.
Power-Softmax: Towards Secure LLM Inference over Encrypted Data cs.LG · 2024-10-12 · unverdicted · none · ref 29 · internal anchor
Power-Softmax is a new HE-compatible attention variant that permits training and inference of billion-parameter polynomial LLMs with performance matching standard transformers.
Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer cs.CL · 2024-08-02 · unverdicted · none · ref 29 · internal anchor
Task prompt vectors, formed by subtracting random initialization from tuned soft prompts, support low-resource initialization and arithmetic combination across tasks on 12 NLU datasets while remaining independent of initialization seed on two model architectures.
FinTruthQA: A Benchmark for AI-Driven Financial Disclosure Quality Assessment in Investor -- Firm Interactions cs.CL · 2024-06-17 · unverdicted · none · ref 22 · internal anchor
Introduces FinTruthQA, a 6,000-entry annotated benchmark for AI assessment of financial disclosure quality across four criteria, with model evaluations showing strong results on question tasks but weaker on answer relevance.
Assessing How Hate, Counterspeech, and Toxicity Affect Hate Group Newcomers cs.CY · 2024-05-28 · unverdicted · none · ref 41 · internal anchor
Counterspeech reduces the likelihood that hate-speech-using newcomers continue posting in hate subreddits, though toxic counterspeech raises the chance of continued hostility in the thread.
Holmes: A Benchmark to Assess the Linguistic Competence of Language Models cs.CL · 2024-04-29 · unverdicted · none · ref 5 · internal anchor
Holmes is a probing benchmark compiling over 200 datasets from 270 studies to evaluate linguistic competence across syntax, morphology, semantics, reasoning, and discourse in more than 50 language models.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation cs.CL · 2024-02-05 · unverdicted · none · ref 51 · internal anchor
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation cs.LG · 2024-12-26 · unverdicted · none · ref 23 · internal anchor
SyMerge merges models via single-layer adaptation and expert-guided self-labeling to achieve task synergy, reporting SOTA results on vision, dense prediction, and NLP tasks.
GigaCheck: Detecting LLM-generated Content via Object-Centric Span Localization cs.CL · 2024-10-31 · unverdicted · none · ref 46 · internal anchor
GigaCheck detects LLM-generated text at both document and span levels by combining fine-tuned language-model embeddings with a DETR-like architecture that treats generated intervals as detectable objects.
Secret Leak Detection in Software Issue Reports using LLMs: A Comprehensive Evaluation cs.SE · 2024-10-31 · accept · none · ref 23 · internal anchor
Creates a 54k-instance benchmark of GitHub issue secrets and shows fine-tuned LLMs reach 94.49% F1 with 81.6% on 178 real repositories.
Conjuring Semantic Similarity cs.AI · 2024-10-21 · unverdicted · none · ref 17 · internal anchor
Semantic similarity between texts is measured by the Jeffreys divergence between the image distributions induced by conditioning a diffusion model on each text, computed via Monte-Carlo sampling of the reverse-time SDEs.
A systematic framework for generating novel experimental hypotheses from language models cs.CL · 2024-08-09 · unverdicted · none · ref 72 · internal anchor
A framework using language models to simulate non-existent experiments and derive novel testable hypotheses on dative verb acquisition and cross-structural generalization in children.
Retrieval-Augmented Generation for Natural Language Processing: A Survey cs.CL · 2024-07-18 · accept · none · ref 117 · internal anchor
The survey organizes RAG methods via a taxonomy of query-based, logits-based, latent, and parametric fusion with comparisons on accessibility, efficiency, applications, and challenges.
animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics cs.SD · 2024-06-03 · unverdicted · none · ref 50 · internal anchor
Introduces animal2vec, a self-supervised transformer for sparse bioacoustic audio, and the MeerKAT meerkat vocalization dataset, claiming outperformance over baselines including in few-shot settings.
A Survey on Vision-Language-Action Models for Embodied AI cs.RO · 2024-05-23 · unverdicted · none · ref 282 · internal anchor
This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.
Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews cs.CL · 2024-03-11 · unverdicted · none · ref 9 · internal anchor
A maximum likelihood model estimates 6.5-16.9% of peer-review text at ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023 was substantially modified by LLMs, with elevated rates in low-confidence and deadline-close submissions.
Retrieval-Augmented Generation with Graphs (GraphRAG) cs.IR · 2024-12-31 · unverdicted · none · ref 263 · internal anchor
A survey proposing a holistic GraphRAG framework with components including query processor, retriever, organizer, generator, and data source, plus domain-tailored reviews, challenges, and future directions.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference cs.CL · 2024-12-18 · unverdicted · none · ref 165 · internal anchor
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning cs.CL · 2024-10-17 · unverdicted · none · ref 15 · internal anchor
AdaSwitch improves small local LLM performance on reasoning tasks by adaptively switching to a large cloud LLM upon detected errors, sometimes matching cloud results with far less overhead.
The Platonic Representation Hypothesis cs.LG · 2024-05-13 · unverdicted · none · ref 276 · internal anchor
Representations learned by large AI models are converging toward a shared statistical model of reality.
Large Language Model-Brained GUI Agents: A Survey cs.AI · 2024-11-27 · unverdicted · none · ref 84 · internal anchor
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
HeartBERT: A Self-Supervised ECG Embedding Model for Efficient and Effective Medical Signal Analysis eess.SP · 2024-11-08 · unverdicted · none · ref 17 · internal anchor
HeartBERT applies self-supervised pretraining on a RoBERTa architecture to ECG signals, producing embeddings that enable strong performance on sleep staging and heartbeat classification with smaller labeled datasets and fewer parameters than baselines.
Are Decoder-Only Large Language Models the Silver Bullet for Code Search? cs.SE · 2024-10-29 · unverdicted · none · ref 69 · internal anchor
Fine-tuned decoder-only LLMs achieve up to 40.4% higher MAP than UniXcoder on CoSQA+ for code search, with non-monotonic size scaling and data composition sensitivity.
Recent Advances in Multimodal Affective Computing: An NLP Perspective cs.CL · 2024-09-11 · unverdicted · none · ref 217 · internal anchor
Survey organizing multimodal affective computing research around four NLP tasks, method paradigms, datasets, evaluation protocols, and future directions while releasing a resource repository.
SleepNet and DreamNet: Enriching and Reconstructing Representations for Consolidated Visual Classification cs.LG · 2024-09-03 · unverdicted · none · ref 28 · internal anchor
SleepNet and DreamNet enrich visual features via supervised pre-trained encoders and reconstruct hidden states with encoder-decoder frameworks to outperform prior state-of-the-art classifiers.
Enhancing Instructional Quality: Leveraging Computer-Assisted Textual Analysis to Generate In-Depth Insights from Educational Artifacts cs.AI · 2024-03-06 · unverdicted · none · ref 37 · internal anchor
AI and NLP applied to educational artifacts within the Instructional Core Framework can identify advantages for teacher coaching, student support, and personalized learning.
Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 25 · internal anchor
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook cs.LG · 2024-06-10 · unverdicted · none · ref 81 · internal anchor
A literature survey reviewing traditional diagnostics, AI-driven studies, and explainable AI models for mental disorder detection via online social media, including datasets, evaluation practices, issues, and future directions.
Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees cs.CL · 2024-10-21 · unreviewed · ref 20 · internal anchor

RoBERTa: A Robustly Optimized BERT Pretraining Approach

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer