hub Canonical reference

Ellie Pavlick and Tom Kwiatkowski

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, Ming Zhou · 2020 · DOI 10.18653/v1/2020

Canonical reference. 85% of citing Pith papers cite this work as background.

49 Pith papers citing it

Background 85% of classified citations

open at publisher browse 49 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 12 method 1

citation-polarity summary

background 11 extend 1 unclear 1

representative citing papers

Pretraining Exposure Explains Popularity Judgments in Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.

SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.

Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

cs.CV · 2026-04-30 · unverdicted · novelty 7.0

Introduces ViTextCaps dataset and PhonoSTFG phonological graph fusion framework for Vietnamese scene-text image captioning, showing cross-modal graph edges harm performance.

LASQ: A Low-resource Aspect-based Sentiment Quadruple Extraction Dataset

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

LASQ is a new quadruple extraction dataset for Uzbek and Uyghur that includes a syntax-aware model showing gains over baselines on the task.

Scaling Laws for Cross-Encoder Reranking

cs.IR · 2026-03-05 · unverdicted · novelty 7.0

Cross-encoder reranker performance scales predictably via power laws with model size and training exposure, allowing accurate forecasts for 400M and 1B models and data-heavy compute allocation.

The Challenge and Reward of Fair Play in Narrative: A Computational Approach

cs.CL · 2025-07-18 · unverdicted · novelty 7.0

Develops an information-theoretic framework showing surprise and coherence trade off in single reader models but coexist via pre- and post-revelation modes, operationalized as reference-less LLM metrics for fair play and validated on generated stories plus classic detective fiction.

Accelerating Large Language Model Decoding with Speculative Sampling

cs.CL · 2023-02-02 · accept · novelty 7.0

Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.

Multitask Prompted Training Enables Zero-Shot Task Generalization

cs.LG · 2021-10-15 · conditional · novelty 7.0

Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.

An Information-Geometric Justification for Composite Coherence in Event-Based Narrative Extraction

cs.IT · 2026-06-28 · unverdicted · novelty 6.0

The paper justifies the composite coherence metric in event-based narrative extraction via an information-geometric decomposition on the product manifold and an axiomatic uniqueness proof for the geometric mean.

TAVR-VLM: Risk-Conditioned Causal Grounding for Hallucination-Resistant Report Generation

cs.AI · 2026-06-25 · unverdicted · novelty 6.0

TAVR-VLM introduces Risk-Conditioned Causal Grounding Attention to achieve SOTA AUROC 0.896, CIDEr 0.936, and 8.1% hallucination rate on a 1,482-patient TAVR cohort.

STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning

cs.AI · 2026-06-07 · unverdicted · novelty 6.0

STAR rethinks MoE routing as structure-aware subspace learning by adding a GHA-tracked principal subspace to standard routers, yielding more stable specialization and better performance on synthetic, language, and vision tasks.

Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Introduces a matched four-condition protocol and ONCU metric to diagnose evidence utilization in long-context and RAG models across synthetic and multi-hop QA tasks.

Link Prediction or Perdition: the Seeds of Instability in Knowledge Graph Embeddings

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

KGEMs for link prediction exhibit high instability in predictions and embeddings from initialization, negative sampling, and other factors, with better MRR not ensuring higher stability.

Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

DIVE improves in-context vector distillation for medical report generation via decisive-token supervision on pathology terms and EOS plus state-conditioned dynamic steering, achieving top BLEU-4, ROUGE-L and RadGraph F1 on MIMIC-CXR and CheXpert Plus.

CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

CLIF applies influence functions to pinpoint influential samples and concepts in CBMs on CEBaB and Yelp datasets, enabling performance restoration via adjustments without retraining.

Task-Adaptive Embedding Refinement via Test-time LLM Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.

Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution

cs.NE · 2026-05-10 · unverdicted · novelty 6.0 · 2 refs

QD-LLM applies neuroevolution to prompt embeddings within a quality-diversity framework, producing 46% higher coverage and 41% higher QD-score than QDAIF on HumanEval, MBPP, and creative writing benchmarks.

Retrieval from Within: An Intrinsic Capability of Attention-Based Models

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.

The Refusal--Compliance Tradeoff: A Large-Scale Safety Behavior Audit of Large Language Models

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

A large-scale audit of 21 LLMs on OR-Bench, XSTest, ToxiGen and BOLD using composition adjustment reveals distinct conservative vs permissive safety strategies, unequal demographic protection, and post-training stability within model families.

When AI reviews science: Can we trust the referee?

cs.AI · 2026-04-26 · unverdicted · novelty 6.0

AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.

CIR: Lightweight Container Image for Cross-Platform Deployment

cs.DC · 2026-04-12 · unverdicted · novelty 6.0

CIR is a cross-platform container image format for Python/R-style apps that defers dependency assembly to deployment, cutting image size by 95% and deployment time by 40-60% versus traditional bundled images.

Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection

cs.CL · 2026-04-07 · unverdicted · novelty 6.0

A metadata-conditioned mT5 model trained on rule-augmented dialectal Arabic data produces translations that better match intended regional varieties than high-resource baselines, despite lower BLEU scores.

JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections

cs.IR · 2026-04-07 · accept · novelty 6.0

JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.

TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.

citing papers explorer

Showing 41 of 41 citing papers after filters.

Pretraining Exposure Explains Popularity Judgments in Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 29
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing cs.AI · 2026-06-29 · unverdicted · none · ref 13
SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.
Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention cs.CV · 2026-04-30 · unverdicted · none · ref 44
Introduces ViTextCaps dataset and PhonoSTFG phonological graph fusion framework for Vietnamese scene-text image captioning, showing cross-modal graph edges harm performance.
LASQ: A Low-resource Aspect-based Sentiment Quadruple Extraction Dataset cs.CL · 2026-04-12 · unverdicted · none · ref 56
LASQ is a new quadruple extraction dataset for Uzbek and Uyghur that includes a syntax-aware model showing gains over baselines on the task.
Scaling Laws for Cross-Encoder Reranking cs.IR · 2026-03-05 · unverdicted · none · ref 28
Cross-encoder reranker performance scales predictably via power laws with model size and training exposure, allowing accurate forecasts for 400M and 1B models and data-heavy compute allocation.
The Challenge and Reward of Fair Play in Narrative: A Computational Approach cs.CL · 2025-07-18 · unverdicted · none · ref 25
Develops an information-theoretic framework showing surprise and coherence trade off in single reader models but coexist via pre- and post-revelation modes, operationalized as reference-less LLM metrics for fair play and validated on generated stories plus classic detective fiction.
An Information-Geometric Justification for Composite Coherence in Event-Based Narrative Extraction cs.IT · 2026-06-28 · unverdicted · none · ref 51
The paper justifies the composite coherence metric in event-based narrative extraction via an information-geometric decomposition on the product manifold and an axiomatic uniqueness proof for the geometric mean.
TAVR-VLM: Risk-Conditioned Causal Grounding for Hallucination-Resistant Report Generation cs.AI · 2026-06-25 · unverdicted · none · ref 3
TAVR-VLM introduces Risk-Conditioned Causal Grounding Attention to achieve SOTA AUROC 0.896, CIDEr 0.936, and 8.1% hallucination rate on a 1,482-patient TAVR cohort.
STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning cs.AI · 2026-06-07 · unverdicted · none · ref 30
STAR rethinks MoE routing as structure-aware subspace learning by adding a GHA-tracked principal subspace to standard routers, yielding more stable specialization and better performance on synthetic, language, and vision tasks.
Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions cs.CL · 2026-06-04 · unverdicted · none · ref 10
Introduces a matched four-condition protocol and ONCU metric to diagnose evidence utilization in long-context and RAG models across synthetic and multi-hop QA tasks.
Link Prediction or Perdition: the Seeds of Instability in Knowledge Graph Embeddings cs.LG · 2026-06-02 · unverdicted · none · ref 29
KGEMs for link prediction exhibit high instability in predictions and embeddings from initialization, negative sampling, and other factors, with better MRR not ensuring higher stability.
Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation cs.CL · 2026-05-26 · unverdicted · none · ref 5
DIVE improves in-context vector distillation for medical report generation via decisive-token supervision on pathology terms and EOS plus state-conditioned dynamic steering, achieving top BLEU-4, ROUGE-L and RadGraph F1 on MIMIC-CXR and CheXpert Plus.
CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models cs.CL · 2026-05-19 · unverdicted · none · ref 7
CLIF applies influence functions to pinpoint influential samples and concepts in CBMs on CEBaB and Yelp datasets, enabling performance restoration via adjustments without retraining.
Task-Adaptive Embedding Refinement via Test-time LLM Guidance cs.CL · 2026-05-12 · unverdicted · none · ref 6
Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution cs.NE · 2026-05-10 · unverdicted · none · ref 15 · 2 links
QD-LLM applies neuroevolution to prompt embeddings within a quality-diversity framework, producing 46% higher coverage and 41% higher QD-score than QDAIF on HumanEval, MBPP, and creative writing benchmarks.
Retrieval from Within: An Intrinsic Capability of Attention-Based Models cs.LG · 2026-05-07 · unverdicted · none · ref 16 · 2 links
Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
The Refusal--Compliance Tradeoff: A Large-Scale Safety Behavior Audit of Large Language Models cs.AI · 2026-05-06 · unverdicted · none · ref 10
A large-scale audit of 21 LLMs on OR-Bench, XSTest, ToxiGen and BOLD using composition adjustment reveals distinct conservative vs permissive safety strategies, unequal demographic protection, and post-training stability within model families.
When AI reviews science: Can we trust the referee? cs.AI · 2026-04-26 · unverdicted · none · ref 31
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.
CIR: Lightweight Container Image for Cross-Platform Deployment cs.DC · 2026-04-12 · unverdicted · none · ref 59
CIR is a cross-platform container image format for Python/R-style apps that defers dependency assembly to deployment, cutting image size by 95% and deployment time by 40-60% versus traditional bundled images.
Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection cs.CL · 2026-04-07 · unverdicted · none · ref 9
A metadata-conditioned mT5 model trained on rule-augmented dialectal Arabic data produces translations that better match intended regional varieties than high-resource baselines, despite lower BLEU scores.
TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models cs.CL · 2026-04-03 · unverdicted · none · ref 11
TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass cs.CL · 2026-02-06 · unverdicted · none · ref 8
SHINE trains a scalable in-context hypernetwork to generate high-quality LoRA adapters from contexts in one pass, enabling efficient LLM adaptation that saves time and compute compared to standard fine-tuning.
Deep sequence models tend to memorize geometrically; it is unclear why cs.LG · 2025-10-30 · unverdicted · none · ref 150
Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.
UNICS: Multilingual Code Search via Unified Pseudocode and Contrastive Transfer Learning cs.SE · 2026-06-26 · unverdicted · none · ref 29
UNICS pre-trains on a pseudocode dataset for cross-lingual logic then applies multi-task transfer learning with hard-positive mining and dynamic hard-negative sampling to reach claimed SOTA on multilingual code-search benchmarks.
An Ontology-Guided Multi-Anchor Graph Retrieval Framework for Traffic Legal Liability Determination cs.CL · 2026-06-10 · unverdicted · none · ref 3
OMAGR decomposes queries into ontology-aligned anchors for parallel multi-dimensional graph retrieval, outperforming baselines on Context Precision and Faithfulness in the new TrafficLaw-QA dataset of 200 questions.
It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO cs.CL · 2026-06-09 · unverdicted · none · ref 17
One-shot GRPO on a single biased example induces generalizing stereotype bias in post-trained LLMs, with susceptibility varying by initial bias likelihood.
CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures cs.CL · 2026-06-04 · unverdicted · none · ref 25
CAF-Gen uses an iterative multi-agent creator-reviewer process to enrich shallow argument mining outputs into structurally richer CAF-compliant models with claimed improvements over single-pass generation.
From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation cs.AI · 2026-05-26 · unverdicted · none · ref 9
N2I-RAG is an agentic RAG pipeline that automates binary legal indicator computation from complex normative texts with explicit traceability to provisions.
TextClusterLab: An Integrated Framework for Reliable Text Clustering Studies cs.IR · 2026-05-17 · unverdicted · none · ref 6
TextClusterLab introduces an LLM-driven generator for synthetic text clustering datasets with tunable attributes and a suitability benchmark for evaluation.
A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation cs.SD · 2026-05-08 · unverdicted · none · ref 6
The RER framework decomposes chord generation into retrieval, editing, and reranking stages to outperform end-to-end models in balancing stylistic diversity with music-theoretic feasibility.
pAI/MSc: ML Theory Research with Humans on the Loop cs.AI · 2026-04-22 · unverdicted · none · ref 76
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript draft in ML theory.
Shiny Stories, Hidden Struggles: Investigating the Representation of Disability Through the Lens of LLMs cs.CL · 2026-04-02 · unverdicted · none · ref 26
LLMs produce overly positive idealized depictions of disability in simulated social media posts that do not match real posts by people with disabilities and show topic bias favoring nondisabled people.
Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications cs.SE · 2026-03-13 · unverdicted · none · ref 20
An automated self-testing framework with evidence-based quality gates for LLM application releases was evaluated in a longitudinal case study of a multi-agent conversational AI system, identifying rollback builds and supporting stable quality over four weeks.
Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents cs.CL · 2026-02-18 · unverdicted · none · ref 6
Calibrate-Then-Act supplies LLM agents with priors on latent environment states to enable explicit cost-uncertainty reasoning, producing more optimal strategies than standard approaches in retrieval QA and file-reading coding tasks.
Always Tell Me The Odds: Fine-grained Conditional Probability Estimation cs.CL · 2025-05-02 · unverdicted · none · ref 5
New LLM-based models for fine-grained conditional probability estimation outperform prior fine-tuned and prompting methods through enhanced data creation and supervision.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions cs.CL · 2023-11-09 · unverdicted · none · ref 157
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
On-Device Robotic Planning: Eliminating Inference Redundancy for Efficient Decision-Making cs.RO · 2026-05-29 · unverdicted · none · ref 63
REIS reduces inference redundancy in embodied robotic planning via lightweight gating and routing while preserving task performance on ALFRED and real robots.
Translation Analytics for Freelancers II: Benchmarking Local LLMs for Confidential Translation Workflows cs.CL · 2026-05-29 · unverdicted · none · ref 4
Local LLMs via Ollama match or exceed some local NMT systems and a frontier LLM on a new multilingual corpus but lag behind top commercial NMTs like DeepL.
Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models cs.SE · 2026-04-28 · unverdicted · none · ref 13
CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.
Matched-Learning-Rate Analysis of Attention Drift and Transfer Retention in Fine-Tuned CLIP cs.LG · 2026-04-01 · unverdicted · none · ref 1
Matched learning-rate experiments show LoRA retains substantially higher zero-shot transfer (45% vs 11% on EuroSAT, 58% vs 9% on Pets) than Full FT in CLIP adaptation.
Multimodal Sexism Identification and Characterization using Large Language Models and Gradient Boosting cs.CV · 2026-06-04 · unverdicted · none · ref 37
A late-fusion gradient-boosting pipeline with LLM semantic features is submitted to the EXIST 2026 lab for sexism identification in memes and videos, showing mixed generalization from development to test data.

Ellie Pavlick and Tom Kwiatkowski

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer