A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
An open-source data contamination report for large language models
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Introduces a feature-level annotated patent dataset and LLM retrieval-reasoning workflows that outperform embedding baselines on passage retrieval and novel feature identification while avoiding spurious correlations in novelty prediction.
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
citing papers explorer
-
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
-
Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval
Introduces a feature-level annotated patent dataset and LLM retrieval-reasoning workflows that outperform embedding baselines on passage retrieval and novel feature identification while avoiding spurious correlations in novelty prediction.
-
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.