Mixed citations

Proceedings of the 2018

Wang, Alex, Singh, Amanpreet, Michael, Julian, Hill, Felix, Levy, Omer, Bowman, Samuel , year = · 2018 · DOI 10.18653/v1/w18-5446

Mixed citation behavior. Most common role is background (67%).

31 Pith papers citing it

Background 67% of classified citations

open at publisher browse 31 citing papers

citation-role summary

background 4 dataset 2

citation-polarity summary

background 4 use dataset 2

representative citing papers

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

cs.CL · 2026-05-19 · accept · novelty 7.0

A corpus-centric framework diagnoses scale, structure, overlap, metadata, and terminology properties across nine biomedical NER/EL corpora, showing substantial differences that common statistics fail to capture.

EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.

Evolutionary Negative Module Pruning for Better LoRA Merging

cs.AI · 2026-04-20 · conditional · novelty 7.0

ENMP prunes negative LoRA modules via evolutionary search to boost merging performance to new state-of-the-art levels across language and vision tasks.

Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild

cs.SE · 2026-01-25 · conditional · novelty 7.0

Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.

Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer

cs.CL · 2024-08-02 · unverdicted · novelty 7.0

Task prompt vectors, formed by subtracting random initialization from tuned soft prompts, support low-resource initialization and arithmetic combination across tasks on 12 NLU datasets while remaining independent of initialization seed on two model architectures.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

cs.CL · 2020-05-22 · accept · novelty 7.0

RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

cs.CL · 2019-09-26 · accept · novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.

GiLT: Augmenting Transformer Language Models with Dependency Graphs

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

GiLT augments Transformers with semantic dependency graphs by modulating attention to improve syntactic generalization while keeping perplexity competitive and enabling better finetuning on downstream tasks.

Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.

MC$^2$: Monte Carlo Correction for Fast Elliptic PDE Solving

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

MC² corrects low-budget Monte Carlo solutions for elliptic PDEs with a single-pass neural network to match the accuracy of 1000× more Monte Carlo samples while outperforming classical and learned baselines.

Extreme Weather Bench: A framework and benchmark for evaluation of high-impact weather

cs.LG · 2026-05-01 · accept · novelty 6.0

Extreme Weather Bench supplies standardized case studies, observational data, impact metrics, and code to evaluate weather models on high-impact hazards.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Parameter-efficient Quantum Multi-task Learning

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

QMTL uses shared VQC encoding plus task-specific quantum ansatz heads to achieve linear parameter scaling with the number of tasks while matching or exceeding classical multi-task baselines on three benchmarks.

Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

cs.LG · 2025-10-12 · unverdicted · novelty 6.0

Preconditioned matrix norms unify steepest descent, quasi-Newton, and adaptive optimizers, revealing SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus as special cases and enabling new methods MuAdam and MuAdam-SANIA that are competitive in experiments.

HyperAdapt: Simple High-Rank Adaptation

cs.LG · 2025-09-23 · unverdicted · novelty 6.0

HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.

Should We Still Pretrain Encoders with Masked Language Modeling?

cs.CL · 2025-07-01 · accept · novelty 6.0

Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improves when initialized from pretrained CLM models.

How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

cs.CL · 2024-11-08 · unverdicted · novelty 6.0

The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

cs.CL · 2023-04-13 · accept · novelty 6.0

AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.

Survey-aware Machine Learning: A Guideline for Valid Population Health Inference based on Scoping Review

stat.ML · 2026-05-09 · unverdicted · novelty 5.0

The authors introduce Survey-aware Machine Learning (SaML) as a nine-step guideline that integrates survey design metadata throughout the ML lifecycle to enable valid population inference from complex health surveys.

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.

Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

cs.LG · 2026-04-24 · unverdicted · novelty 5.0

Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.

Model-Agnostic Meta Learning for Class Imbalance Adaptation

cs.CL · 2026-04-20 · conditional · novelty 5.0

HAMR combines meta-learning with hardness-aware weighting and neighborhood resampling to improve minority-class performance on imbalanced NLP datasets.

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

cs.CL · 2024-12-18 · unverdicted · novelty 5.0

ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.

Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility

cs.LG · 2026-05-07 · unverdicted · novelty 4.0 · 2 refs

Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.

citing papers explorer

Showing 31 of 31 citing papers.

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework cs.CL · 2026-05-19 · accept · none · ref 38
A corpus-centric framework diagnoses scale, structure, overlap, metadata, and terminology properties across nine biomedical NER/EL corpora, showing substantial differences that common statistics fail to capture.
EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints cs.CL · 2026-05-09 · unverdicted · none · ref 25
EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.
Evolutionary Negative Module Pruning for Better LoRA Merging cs.AI · 2026-04-20 · conditional · none · ref 33
ENMP prunes negative LoRA modules via evolutionary search to boost merging performance to new state-of-the-art levels across language and vision tasks.
Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild cs.SE · 2026-01-25 · conditional · none · ref 63
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer cs.CL · 2024-08-02 · unverdicted · none · ref 46
Task prompt vectors, formed by subtracting random initialization from tuned soft prompts, support low-resource initialization and arithmetic combination across tasks on 12 NLU datasets while remaining independent of initialization seed on two model architectures.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks cs.CL · 2020-05-22 · accept · none · ref 64
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations cs.CL · 2019-09-26 · accept · none · ref 36
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
GiLT: Augmenting Transformer Language Models with Dependency Graphs cs.CL · 2026-05-15 · unverdicted · none · ref 32
GiLT augments Transformers with semantic dependency graphs by modulating attention to improve syntactic generalization while keeping perplexity competitive and enabling better finetuning on downstream tasks.
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks cs.LG · 2026-05-11 · unverdicted · none · ref 38
Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
MC$^2$: Monte Carlo Correction for Fast Elliptic PDE Solving cs.LG · 2026-05-10 · unverdicted · none · ref 35
MC² corrects low-budget Monte Carlo solutions for elliptic PDEs with a single-pass neural network to match the accuracy of 1000× more Monte Carlo samples while outperforming classical and learned baselines.
Extreme Weather Bench: A framework and benchmark for evaluation of high-impact weather cs.LG · 2026-05-01 · accept · none · ref 29
Extreme Weather Bench supplies standardized case studies, observational data, impact metrics, and code to evaluate weather models on high-impact hazards.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 122
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
Parameter-efficient Quantum Multi-task Learning cs.LG · 2026-04-15 · unverdicted · none · ref 39
QMTL uses shared VQC encoding plus task-specific quantum ansatz heads to achieve linear parameter scaling with the number of tasks while matching or exceeding classical multi-task baselines on three benchmarks.
Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods cs.LG · 2025-10-12 · unverdicted · none · ref 21
Preconditioned matrix norms unify steepest descent, quasi-Newton, and adaptive optimizers, revealing SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus as special cases and enabling new methods MuAdam and MuAdam-SANIA that are competitive in experiments.
HyperAdapt: Simple High-Rank Adaptation cs.LG · 2025-09-23 · unverdicted · none · ref 37
HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.
Should We Still Pretrain Encoders with Masked Language Modeling? cs.CL · 2025-07-01 · accept · none · ref 40
Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improves when initialized from pretrained CLM models.
How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP cs.CL · 2024-11-08 · unverdicted · none · ref 61
The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models cs.CL · 2023-04-13 · accept · none · ref 74
AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.
Survey-aware Machine Learning: A Guideline for Valid Population Health Inference based on Scoping Review stat.ML · 2026-05-09 · unverdicted · none · ref 34
The authors introduce Survey-aware Machine Learning (SaML) as a nine-step guideline that integrates survey design metadata throughout the ML lifecycle to enable valid population inference from complex health surveys.
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 39 · 2 links
Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 80
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
Model-Agnostic Meta Learning for Class Imbalance Adaptation cs.CL · 2026-04-20 · conditional · none · ref 57
HAMR combines meta-learning with hardness-aware weighting and neighborhood resampling to improve minority-class performance on imbalanced NLP datasets.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference cs.CL · 2024-12-18 · unverdicted · none · ref 194
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility cs.LG · 2026-05-07 · unverdicted · none · ref 203 · 2 links
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
A Benchmark Suite of Reddit-Derived Datasets for Mental Health Detection cs.CL · 2026-04-25 · unverdicted · none · ref 24
Four new Reddit-derived datasets for mental health detection tasks are presented with inter-annotator agreement above 0.8 and reported model F1 scores of 93-99%.
Investigating the Representation of Backchannels and Fillers in Fine-tuned Language Models cs.CL · 2025-09-24 · unverdicted · none · ref 60
Fine-tuning on annotated English and Japanese dialogues improves clustering of backchannels and fillers and makes generated utterances closer to human ones.
Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked? cs.CL · 2025-07-21 · unverdicted · none · ref 34
LLM accuracy on reasoning tasks differs significantly by question type, with step-by-step reasoning accuracy often uncorrelated to final answer selection.
Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models cs.LG · 2025-08-06 · unverdicted · none · ref 8
A systematic literature review of explainability in multimodal attention models finds most studies focus on vision-language tasks with attention-based explanations, but evaluation methods lack consistency and modality-specific considerations.
How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework cs.CL · 2026-05-22 · unreviewed · ref 95
Position: State-of-the-Art Claims Require State-of-the-Art Evidence cs.LG · 2026-05-17 · unreviewed · ref 6
MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining cs.CL · 2026-04-27 · unreviewed · ref 41

Proceedings of the 2018

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer