LibEvoBench benchmark shows LLMs are version-oblivious on evolving APIs, with documentation helping but version specification not.
hub
and Eisenschlos, Julian Martin and Gillick, Daniel and Eisenstein, Jacob and Cohen, William W
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Prefix gain measured via student-model solve-rate improvement is used to train a Prefix Utility Model (PUM) that supplies stronger supervision than correctness-based process rewards for mathematical reasoning.
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
Norm-Anchor Scaling breaks the norm-feedback loop in sequential LLM editing by anchoring value vectors to original norms, improving long-run performance by 72.2% and extending the editing horizon over 4x.
TIME trains LLMs to trigger compact, context-triggered reasoning via time tags and tick events, improving TIMEBench scores while cutting explicit reasoning tokens by an order of magnitude.
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
Decisive combines document-grounded option scoring with adaptive Bayesian preference elicitation to achieve up to 20% higher decision accuracy than LLMs and existing frameworks across domains.
EMERGE is a benchmark dataset of 233K Wikipedia passages paired with 1.45 million Wikidata edit operations across seven yearly snapshots from 2019 to 2025 for evaluating knowledge graph updates from emerging text.
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
Late fusion of absolute and relative temporal metadata into Transformer NER models produces more robust performance than early fusion on French and German historical datasets, especially in early noisy periods.
ProbScale finds layer subsets in SLMs like RoBERTa-Large and T5-Base that cut parameters 5-10x while retaining 95-98% of original task performance by maximizing aggregated probe scores under a budget.
KARITA integrates knowledge-driven augmentation and retrieval to improve classification performance under temporal shifts across clinical, legal, and scientific domains.
Amplifying time-awareness features in LLMs via sparse autoencoders reduces look-ahead bias in forecasting while preserving general performance.
LLM-extracted patterns merging logical structures and linguistic cues yield statistically significant gains in fallacy classification over zero-shot baselines with cross-dataset generalization.
MMoA adds LSTM recurrence to Mixture-of-Agents routing, reaching 58.0% win rate on AlpacaEval 2.0 versus 59.8% for baseline MoA while cutting runtime by up to 4.6%.
DGMM is proposed as an explicit graph-structured memory architecture for AI that enables persistent episodic memory, cue-based recall, and context-dependent interpretation without retraining.
Fine-tuned PEGASUS achieves state-of-the-art ROUGE scores on XL-Sum English corpus with 4.04% ROUGE-1, 15.25% ROUGE-2, and 3.39% ROUGE-L gains over mT5 baseline.
citing papers explorer
-
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
-
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.