LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
hub Canonical reference
Ellie Pavlick and Tom Kwiatkowski
Canonical reference. 85% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.
Introduces ViTextCaps dataset and PhonoSTFG phonological graph fusion framework for Vietnamese scene-text image captioning, showing cross-modal graph edges harm performance.
LASQ is a new quadruple extraction dataset for Uzbek and Uyghur that includes a syntax-aware model showing gains over baselines on the task.
Cross-encoder reranker performance scales predictably via power laws with model size and training exposure, allowing accurate forecasts for 400M and 1B models and data-heavy compute allocation.
Develops an information-theoretic framework showing surprise and coherence trade off in single reader models but coexist via pre- and post-revelation modes, operationalized as reference-less LLM metrics for fair play and validated on generated stories plus classic detective fiction.
Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
The paper justifies the composite coherence metric in event-based narrative extraction via an information-geometric decomposition on the product manifold and an axiomatic uniqueness proof for the geometric mean.
TAVR-VLM introduces Risk-Conditioned Causal Grounding Attention to achieve SOTA AUROC 0.896, CIDEr 0.936, and 8.1% hallucination rate on a 1,482-patient TAVR cohort.
STAR rethinks MoE routing as structure-aware subspace learning by adding a GHA-tracked principal subspace to standard routers, yielding more stable specialization and better performance on synthetic, language, and vision tasks.
Introduces a matched four-condition protocol and ONCU metric to diagnose evidence utilization in long-context and RAG models across synthetic and multi-hop QA tasks.
KGEMs for link prediction exhibit high instability in predictions and embeddings from initialization, negative sampling, and other factors, with better MRR not ensuring higher stability.
DIVE improves in-context vector distillation for medical report generation via decisive-token supervision on pathology terms and EOS plus state-conditioned dynamic steering, achieving top BLEU-4, ROUGE-L and RadGraph F1 on MIMIC-CXR and CheXpert Plus.
CLIF applies influence functions to pinpoint influential samples and concepts in CBMs on CEBaB and Yelp datasets, enabling performance restoration via adjustments without retraining.
Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
QD-LLM applies neuroevolution to prompt embeddings within a quality-diversity framework, producing 46% higher coverage and 41% higher QD-score than QDAIF on HumanEval, MBPP, and creative writing benchmarks.
Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
A large-scale audit of 21 LLMs on OR-Bench, XSTest, ToxiGen and BOLD using composition adjustment reveals distinct conservative vs permissive safety strategies, unequal demographic protection, and post-training stability within model families.
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.
CIR is a cross-platform container image format for Python/R-style apps that defers dependency assembly to deployment, cutting image size by 95% and deployment time by 40-60% versus traditional bundled images.
A metadata-conditioned mT5 model trained on rule-augmented dialectal Arabic data produces translations that better match intended regional varieties than high-resource baselines, despite lower BLEU scores.
JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.
TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
citing papers explorer
- A Reproducible Benchmark and Evidence-Retrieval Software Framework for Silicon Detector R&D Literature
- CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
- Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs
- Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models