super hub Mixed citations

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Kenton Lee, Kristina Toutanova, Ming-Wei Chang · 2019 · Proceedings of the 2019 Conference of the North · DOI 10.18653/v1/n19-1423

Mixed citation behavior. Most common role is background (68%).

276 Pith papers citing it

6,639 external citations · Crossref

Background 68% of classified citations

open at publisher browse 276 citing papers more from Jacob Devlin

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 25 method 7 dataset 1 other 1

citation-polarity summary

background 23 use method 7 unclear 3 use dataset 1

claims ledger

background The retrieval system only manages to fetch informationabout Fleming's professional achievements in the discoveryof penicillin. However, the document does not provide informa-tion about his educational background, thus the model generates ahallucinatory answer. inappropriately activated, blindly retrieving inaccurate information and consequently leading to an undesirable response. Consequently, several studies [75, 204, 228, 378] have proposed to make a shift from passive retrieval to adaptive re

authors

Jacob Devlin Kenton Lee Kristina Toutanova Ming-Wei Chang

co-cited works

representative citing papers

FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes

cs.CL · 2026-06-01 · conditional · novelty 8.0

FigSIM is the first annotated dataset for fine-grained suicide severity and figurative language in suicide memes, accompanied by benchmarks on 16 unimodal and multimodal models.

Reachability and asymptotics of Gaussian Transformer dynamics

cs.LG · 2026-05-29 · unverdicted · novelty 8.0

Gaussian distributions are invariant under the mean-field Transformer flow, reducing infinite-dimensional dynamics to a bilinear control system on mean and covariance with explicit reachability and stability results.

QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

cs.AI · 2026-05-18 · accept · novelty 8.0

QSTRBench is a new benchmark evaluating LLMs on compositional reasoning, converse relations, and conceptual neighbourhoods across QSTR calculi including a newly published RCC-22 CN, showing models exceed chance but fail to achieve consistent correctness.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Locating and Editing Factual Associations in GPT

cs.CL · 2022-02-10 · accept · novelty 8.0

Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.

SimCSE: Simple Contrastive Learning of Sentence Embeddings

cs.CL · 2021-04-18 · conditional · novelty 8.0

SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.

SpheRoPE: Zero-Shot Optimization-Free 360 Panorama Generation with Spherical RoPE

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

SpheRoPE modifies rotary position embeddings in diffusion transformers to enforce spherical topology for zero-shot 360 panorama generation across multiple backbones.

Evaluation Pitfalls and Challenges in Multimedia Event Extraction

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

A systematic analysis of evaluation practices in multimedia event extraction reveals that minor methodological choices cause large performance swings and overestimation of cross-modal grounding ability.

Structure Before Collapse: Transient semantic geometry in next-token prediction

cs.LG · 2026-06-25 · unverdicted · novelty 7.0

Semantic geometry emerges transiently early in next-token prediction training before collapsing to Neural Collapse symmetry in synthetic settings with latent semantic factors.

Understanding Parallel Samplers in Masked Diffusion via Random Walks on Graphs

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

Graph random walks provide a verifiable sandbox for diagnosing parallel samplers in masked diffusion models, showing performance depends on graph structure and introducing a new exact bisection sampler.

How Does Research Evolve? Tracing Cross-Domain Trajectories in NLP, ML, and CV with Claim-Grounded Typed Citations

cs.CL · 2026-06-21 · unverdicted · novelty 7.0

SciTraj is the first claim-grounded typed citation graph with 32,559 papers and 573,126 edges across six relation types, plus a temporally split link-prediction benchmark.

OVIG: Optimistic Verification of AI Training Integrity via Gradient Signals

cs.CR · 2026-06-19 · unverdicted · novelty 7.0

OVIG introduces an optimistic gradient-based verification framework for outsourced AI post-training that uses stride-sampled interval checks against an honest-replay boundary to achieve 0% attack success rate with low overhead.

Structured Inference with Large Language Gibbs

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

Large Language Gibbs uses LLM next-token conditionals as MCMC transition operators for iterative resampling of structured variables, aiming to produce a stationary distribution that compromises across all local conditionals.

CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models

cs.LG · 2026-06-16 · conditional · novelty 7.0

CheckMIABench converts LLMs with intermediate checkpoints into clean MIA testbeds by using pre- and post-checkpoint training data from the same distribution and evaluates published attacks on Pythia and OLMo models while releasing an open-source library.

Applicability Condition Extraction for Therapeutic Drug-Disease Relations

cs.AI · 2026-06-12 · unverdicted · novelty 7.0

Introduces applicability condition extraction for therapeutic drug-disease relations, creates first annotated dataset of 1,119 pairs, and proposes enhanced LoRA method outperforming baselines.

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

AfriSUD supplies new SUD-annotated dependency treebanks for nine Sub-Saharan African languages and demonstrates that existing models exhibit clear limitations on their syntax.

WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

WorldReasoner supplies 345 resolved forecasting tasks built from 14,141 articles to score LM agents on outcome quality, evidence quality, and reasoning quality against time-bounded evidence and hindsight graphs.

Continuous Language Diffusion as a Decoder-Interface Problem

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

Continuous language diffusion works by entering high-margin decoder basins where frozen T5 embeddings recover 93-96% of native decisions and linear readouts reach 97.9% agreement, implying models should be evaluated as representation-decoder systems.

Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method

cs.DB · 2026-06-06 · unverdicted · novelty 7.0

An adaptive two-phase semantic filter using clustering then a hybrid proxy trained on LLM confidence achieves 1.6-2.0x speedup over prior methods at 90% accuracy on 10K document corpora.

Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

Structurally distinct circuits for literal sequence copying across token frequency bands implement the same computation, shown by broad transfer of band-specific edges, a shared core recovering 99% performance, and interchangeable representations via causal interventions.

Multilingual Coreference Resolution via Cycle-Consistent Machine Translation

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

A cycle-consistent MT pipeline generates and similarity-weights training data for coreference resolution, producing gains on four low-resource languages and enabling the task where no corpora existed.

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

ClinicalMC is a benchmark of 1,275 Chinese and 5,804 English multi-course clinical samples across four stages, evaluated via a multi-agent framework on closed-source, open-source, and medical LLMs in static and dynamic settings.

EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

Introduces EURO-5K dataset from 136 EU acts and benchmarks full fine-tuning vs QLoRA for BERT and LLM models on reporting obligation extraction, reporting 0.89 F1 with limited gains from legal pretraining except under parameter-efficient adaptation.

Learning Coherent Representations: A Topological Approach to Interpretability

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

Introduces coherence as a topological constraint on representations and the Coh objective to enforce geometric clustering for interpretability in neural networks.

citing papers explorer

Showing 49 of 49 citing papers after filters.

Reachability and asymptotics of Gaussian Transformer dynamics cs.LG · 2026-05-29 · unverdicted · none · ref 23
Gaussian distributions are invariant under the mean-field Transformer flow, reducing infinite-dimensional dynamics to a bilinear control system on mean and covariance with explicit reachability and stability results.
Structure Before Collapse: Transient semantic geometry in next-token prediction cs.LG · 2026-06-25 · unverdicted · none · ref 195
Semantic geometry emerges transiently early in next-token prediction training before collapsing to Neural Collapse symmetry in synthetic settings with latent semantic factors.
Understanding Parallel Samplers in Masked Diffusion via Random Walks on Graphs cs.LG · 2026-06-22 · unverdicted · none · ref 11
Graph random walks provide a verifiable sandbox for diagnosing parallel samplers in masked diffusion models, showing performance depends on graph structure and introducing a new exact bisection sampler.
Structured Inference with Large Language Gibbs cs.LG · 2026-06-17 · unverdicted · none · ref 1
Large Language Gibbs uses LLM next-token conditionals as MCMC transition operators for iterative resampling of structured variables, aiming to produce a stationary distribution that compromises across all local conditionals.
CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models cs.LG · 2026-06-16 · conditional · none · ref 197
CheckMIABench converts LLMs with intermediate checkpoints into clean MIA testbeds by using pre- and post-checkpoint training data from the same distribution and evaluates published attacks on Pythia and OLMo models while releasing an open-source library.
Learning Coherent Representations: A Topological Approach to Interpretability cs.LG · 2026-06-01 · unverdicted · none · ref 29
Introduces coherence as a topological constraint on representations and the Coh objective to enforce geometric clustering for interpretability in neural networks.
Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them cs.LG · 2026-05-29 · conditional · none · ref 44
Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.
Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation cs.LG · 2026-05-18 · unverdicted · none · ref 68
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
TILT: Target-induced loss tilting under covariate shift cs.LG · 2026-05-14 · conditional · none · ref 82
TILT adds a target-data penalty on an auxiliary predictor component to induce effective importance weighting for unsupervised domain adaptation under covariate shift.
Neural Cluster First, Route Second: One-Shot Capacitated Vehicle Routing via Differentiable Optimal Transport cs.LG · 2026-05-10 · unverdicted · none · ref 7
Neural CFRS is a non-autoregressive one-shot framework for CVRP that uses entropic optimal transport for capacitated clustering and achieves competitive gaps on large instances.
SeBA: Semi-supervised few-shot learning via Separated-at-Birth Alignment for tabular data cs.LG · 2026-05-08 · unverdicted · none · ref 234
SeBA is a joint-embedding framework that separates tabular data into two complementary views and aligns one view's representations to the nearest-neighbor structure of the other, improving feature-label relationships and achieving SOTA results in most benchmarks without relying on augmentations.
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors cs.LG · 2026-04-21 · unverdicted · none · ref 206
NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization cs.LG · 2026-04-14 · unverdicted · none · ref 18
STOMP extends direct preference optimization to the multi-objective setting via smooth Tchebysheff scalarization and standardization of observed rewards, achieving highest hypervolume in eight of nine protein engineering evaluations.
Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning cs.LG · 2026-04-07 · unverdicted · none · ref 14
Multimodal contrastive learning using multilinear products is fragile to single bad modalities, and a gated version improves top-1 retrieval accuracy on synthetic and real trimodal data.
Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models cs.LG · 2026-02-04 · unverdicted · none · ref 5
Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.
Effective Model Pruning: Measure The Redundancy of Model Components cs.LG · 2025-09-30 · unverdicted · none · ref 4
EMP maps importance scores to effective sample size N_eff and prunes the lowest N - N_eff components, with a derived lower bound on retained effective mass and upper bound on loss increase.
Task complexity shapes internal representations and robustness in neural networks cs.LG · 2025-08-07 · unverdicted · none · ref 16
Harder classification tasks produce neural representations whose accuracy collapses under binarization and shuffling while easier tasks remain robust, defining task complexity via the performance gap between full-precision and perturbed networks.
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations cs.LG · 2024-02-27 · unverdicted · none · ref 106
HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, and improve online A/B metrics by 12.4%.
Spectral Retrieval-Augmented Time-Series Forecasting cs.LG · 2026-06-17 · unverdicted · none · ref 29
SpecReTF improves time series forecasting by retrieving similar historical patterns using windowed frequency representations with combined amplitude-phase similarity and exponential recency weighting, outperforming time-domain methods on benchmarks.
sGPO: Trading Inference FLOPs for Training Efficiency in RLVR cs.LG · 2026-06-07 · unverdicted · none · ref 203
sGPO uses an initial-policy success-rate profiling pass to adaptively set rollout group sizes, filter data, and build a curriculum, cutting total RLVR training compute by 3x while matching baseline performance.
Contrastive Learning and Correlation Clustering for Sequences of Network Telescope Data cs.LG · 2026-06-03 · unverdicted · none · ref 20
A contrastive learning transformer embeds network flow sequences to enable correlation clustering that groups scanner sources consistently with labels.
How Many Different Outputs Can a Transformer Generate? cs.LG · 2026-05-21 · unverdicted · none · ref 29
Transformers are limited to a linearly growing number of accessible output sequences with prompt length, with exponential decay in accessible proportion beyond a critical point, even under unbounded context.
Structure-Aware Masking for Protein Representation Learning cs.LG · 2026-05-15 · conditional · none · ref 2
Bucket Masking improves protein fitness prediction by up to 14% over random masking by preferentially masking structurally coupled residue groups on four downstream tasks.
Molecules Meet Language: Confound-Aware Representation Learning and Chemical Property Steering in Transformer-VAE Latent Spaces cs.LG · 2026-05-07 · unverdicted · none · ref 9
Chemically meaningful steering for properties like cLogP and TPSA emerges in entangled Transformer-VAE latent spaces only after controlling for SELFIES representation confounds through residualization and decoded traversals.
A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks cs.LG · 2026-05-06 · unverdicted · none · ref 24
A dual-purpose benchmark supplies two text-derived knowledge graphs and one expert reference graph on the same biomedical corpus to jointly measure construction method quality and GNN robustness via semi-supervised node classification.
Skipping the Zeros in Diffusion Models for Sparse Data Generation cs.LG · 2026-05-03 · unverdicted · none · ref 1 · 2 links
SED modifies diffusion models to skip zeros during training and inference, preserving sparsity patterns while claiming computational savings and comparable or better generation quality on benchmarks.
Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces cs.LG · 2026-05-01 · unverdicted · none · ref 6 · 2 links
KAHM yields a compute-efficient query encoder that outperforms matched learned adapters in reconstructing a frozen Mixedbread embedding space on an Austrian-law retrieval task while delivering an 8.53x CPU speedup.
PrismAgent: Illuminating Harm in Memes via a Zero-Shot Interpretable Multi-Agent Framework cs.LG · 2026-05-01 · unverdicted · none · ref 15
PrismAgent deploys four specialized LLM agents in sequence to analyze meme intent, gather context, make preliminary judgments, and deliver a final harm verdict, outperforming prior zero-shot methods on three public datasets.
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees cs.LG · 2026-05-01 · unverdicted · none · ref 76 · 2 links
InvEvolve evolves inventory policies using LLMs with RL and provides statistical safety guarantees, outperforming classical and DL methods on synthetic and real data.
Polaris: Coupled Orbital Polar Embeddings for Hierarchical Concept Learning cs.LG · 2026-04-30 · unverdicted · none · ref 2
Polaris separates semantic meaning from hierarchical structure in embeddings via angular geometry and radius on a hypersphere, yielding up to 19-point gains in taxonomy expansion retrieval over baselines.
Towards E-Value Based Stopping Rules for Bayesian Deep Ensembles cs.LG · 2026-04-20 · unverdicted · none · ref 253
E-value sequential tests enable early stopping of MCMC sampling in Bayesian deep ensembles, often needing only a fraction of the full budget while improving over standard deep ensembles.
Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning cs.LG · 2026-04-15 · unverdicted · none · ref 8
ESC-RL improves RL for radiology reports via group-wise evidence-aware rewards (GEAR) and LLM-driven self-correcting preference learning (SPL), reaching state-of-the-art on two chest X-ray datasets.
One Model for All: Multi-Objective Controllable Language Models cs.LG · 2026-04-06 · unverdicted · none · ref 5
Multi-Objective Control trains a single LLM as a preference-conditioned policy using multi-objective optimization in RLHF to produce outputs in user-specified regions of the Pareto front.
Fundamental Limitations of Favorable Privacy-Utility Guarantees for DP-SGD cs.LG · 2026-01-15 · unverdicted · none · ref 20
Shuffled DP-SGD requires σ ≥ 1/√(2 ln M) or κ ≥ (1/√8)(1 - 1/√(4π ln M)) to limit adversarial advantage, preventing strong privacy and high utility simultaneously.
Fusion or Confusion? Multimodal Complexity Is Not All You Need cs.LG · 2025-12-28 · unverdicted · none · ref 11
Complex multimodal architectures do not reliably outperform unimodal baselines or a simple multimodal baseline under standardized evaluation.
Curriculum-guided multimodal representation learning enables generalizable prediction of nanomaterial-protein interactions cs.LG · 2025-07-18 · conditional · none · ref 18
CuMMI applies curriculum learning across progressively complex biofluids to a multimodal model integrating protein sequence, structure, and 37 experimental features, achieving mean classification metrics above 0.75 on temporal, nanomaterial-held-out, and protein-held-out tests.
Rotary Masked Autoencoders are Versatile Learners cs.LG · 2025-05-26 · unverdicted · none · ref 15
RoMAE applies rotary positional embeddings to masked autoencoders to enable representation learning and interpolation on continuous positional data across irregular time-series, images, and audio without modality-specific modifications.
BloombergGPT: A Large Language Model for Finance cs.LG · 2023-03-30 · conditional · none · ref 29
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
Mixture-of-Control: State-Aware Fine-Tuning for Transformer-based Models cs.LG · 2026-06-30 · unverdicted · none · ref 20
Mixture-of-Control adaptively combines local and global control states in transformer fine-tuning by treating per-block states as experts in a sparse MoE setup to improve cross-block communication while keeping memory and compute costs comparable to prior state-based methods.
Persona-Trained Monte Carlo: Estimating Market-Outcome Distributions via Swarms of Persona-Conditioned Neural Policy Bots in a Limit Order Book cs.LG · 2026-06-28 · unverdicted · none · ref 51
PTMC is a proposed Monte Carlo estimator that generates market-outcome distributions by simulating continuous double-auction interactions among persona-conditioned neural-policy bots whose heterogeneity is drawn from a learned distribution.
Objective-Behavior Alignment: Diagnostics for MORL Policy Selection cs.LG · 2026-06-19 · unverdicted · none · ref 83
Proposes an exploratory diagnostic workflow to highlight behavioral variation along MORL Pareto fronts not captured by objective values, with validation on grid and continuous control tasks.
Conservation Laws for Modern Neural Architectures cs.LG · 2026-06-16 · unverdicted · none · ref 13
Unified framework characterizes conservation laws for gradient flow in feedforward networks with GELU/SiLU/SwiGLU, multihead attention with positional encodings, and MoE models under various gating.
Hide to Guide: Learning via Semantic Masking cs.LG · 2026-05-24 · unverdicted · none · ref 41
SMEPO applies fine-grained semantic masking to expert guidance in RLVR, turning hard problems into fill-in-the-blank tasks while preserving structure, yielding up to 3.2 point accuracy gains and 4.2x faster training.
Toto 2.0: Time Series Forecasting Enters the Scaling Era cs.LG · 2026-05-19 · unverdicted · none · ref 15 · 2 links
Time series foundation models scale under a single training recipe, with forecast quality improving from 4M to 2.5B parameters and new SOTA results on BOOM, GIFT-Eval, and TIME benchmarks.
To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Learning, Except In Heavy Truncation Scenarios cs.LG · 2026-05-15 · unverdicted · none · ref 31
Truncated embeddings from non-MRL models perform comparably to or better than MRL-trained models for most truncation levels, except heavy truncation of 80% or more.
From Sparsity to Simplicity: Enabling Simpler Sequential Replacements via Sparse Attention Distillation cs.LG · 2026-05-15 · unverdicted · none · ref 5
Sparsity-guided distillation enables replacing attention layers in ViTs with simpler sequential modules, with sparser layers showing smaller performance drops.
The Efficiency Gap in Byte Modeling cs.LG · 2026-05-13 · unverdicted · none · ref 51
Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
LLM4Delay: Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation cs.LG · 2025-10-24 · unverdicted · none · ref 29
LLM4Delay improves flight delay prediction accuracy by using instance-level projection to adapt LLMs for integrating textual aeronautical information with multiple aircraft trajectories.
Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment cs.LG · 2026-05-31 · unverdicted · none · ref 170
The paper calls for life cycle assessment to capture embodied hardware costs and full pipeline operational costs in AI development and deployment.

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer