hub

and Eisenschlos, Julian Martin and Gillick, Daniel and Eisenstein, Jacob and Cohen, William W

Cole, J · 2022 · DOI 10.1162/tacl_a_00459

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

open at publisher browse 18 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models

cs.SE · 2026-06-24 · unverdicted · novelty 7.0

LibEvoBench benchmark shows LLMs are version-oblivious on evolving APIs, with documentation helping but version specification not.

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

cs.CL · 2026-06-05 · unverdicted · novelty 7.0

Prefix gain measured via student-model solve-rate improvement is used to train a Prefix Utility Model (PUM) that supplies stronger supervision than correctness-based process rewards for mathematical reasoning.

Evaluating Temporal Consistency in Multi-Turn Language Models

cs.CL · 2026-04-24 · unverdicted · novelty 7.0

Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.

Norm Anchors Make Model Edits Last

cs.LG · 2026-01-30 · conditional · novelty 7.0

Norm-Anchor Scaling breaks the norm-feedback loop in sequential LLM editing by anchoring value vectors to original norms, improving long-run performance by 72.2% and extending the editing horizon over 4x.

TIME: Temporally Intelligent Meta-reasoning Engine for Context-Triggered Explicit Reasoning

cs.LG · 2026-01-08 · unverdicted · novelty 7.0

TIME trains LLMs to trigger compact, context-triggered reasoning via time tags and tick events, improving TIMEBench scores while cutting explicit reasoning tokens by an order of magnitude.

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

cs.CL · 2024-04-29 · conditional · novelty 7.0

A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

cs.CL · 2024-01-27 · accept · novelty 7.0

MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.

Decisive: Guiding User Decisions with Optimal Preference Elicitation from Unstructured Documents

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

Decisive combines document-grounded option scoring with adaptive Bayesian preference elicitation to achieve up to 20% higher decision accuracy than LLMs and existing frameworks across domains.

EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge

cs.CL · 2025-07-04 · accept · novelty 6.0

EMERGE is a benchmark dataset of 233K Wikipedia passages paired with 1.45 million Wikidata edit operations across seven yearly snapshots from 2019 to 2025 for evaluating knowledge graph updates from emerging text.

Atlas: Few-shot Learning with Retrieval Augmented Language Models

cs.CL · 2022-08-05 · unverdicted · novelty 6.0

Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.

A Study of Temporal Fusion Strategies for Named Entity Recognition in Historical Texts

cs.CL · 2026-06-26 · unverdicted · novelty 5.0

Late fusion of absolute and relative temporal metadata into Transformer NER models produces more robust performance than early fusion on French and German historical datasets, especially in early noisy periods.

ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference

cs.CL · 2026-06-01 · unverdicted · novelty 5.0

ProbScale finds layer subsets in SLMs like RoBERTa-Large and T5-Base that cut parameters 5-10x while retaining 95-98% of original task performance by maximizing aggregated probe scores under a budget.

Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation

cs.CL · 2026-04-23 · unverdicted · novelty 5.0

KARITA integrates knowledge-driven augmentation and retrieval to improve classification performance under temporal shifts across clinical, legal, and scientific domains.

Forecasting With LLMs: Improved Generalization Through Feature Steering

cs.CL · 2026-06-25 · unverdicted · novelty 4.0

Amplifying time-awareness features in LLMs via sparse autoencoders reduces look-ahead bias in forecasting while preserving general performance.

Beyond Logical Forms: LLM-Extracted Patterns for Fallacy Classification

cs.CL · 2026-06-25 · unverdicted · novelty 4.0

LLM-extracted patterns merging logical structures and linguistic cues yield statistically significant gains in fallacy classification over zero-shot baselines with cross-dataset generalization.

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

cs.CL · 2026-05-18 · unverdicted · novelty 3.0

MMoA adds LSTM recurrence to Mixture-of-Agents routing, reaching 58.0% win rate on AlpacaEval 2.0 versus 59.8% for baseline MoA while cutting runtime by up to 4.6%.

The Dynamic Gist-Based Memory Model (DGMM): A Memory-Centric Architecture for Artificial Intelligence

cs.AI · 2026-05-04 · unverdicted · novelty 3.0

DGMM is proposed as an explicit graph-structured memory architecture for AI that enables persistent episodic memory, cue-based recall, and context-dependent interpretation without retraining.

Optimizing Abstractive Summarization With Fine-Tuned PEGASUS

cs.CL · 2026-06-24 · unverdicted · novelty 2.0

Fine-tuned PEGASUS achieves state-of-the-art ROUGE scores on XL-Sum English corpus with 4.04% ROUGE-1, 15.25% ROUGE-2, and 3.39% ROUGE-L gains over mT5 baseline.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models cs.CL · 2024-04-29 · conditional · none · ref 296
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries cs.CL · 2024-01-27 · accept · none · ref 295
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.

and Eisenschlos, Julian Martin and Gillick, Daniel and Eisenstein, Jacob and Cohen, William W

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer