SFBench provides 197 expert-created materials science claims with feasibility scores and explanations to evaluate AI systems on scientific feasibility assessment.
Multinrc: A challenging and native multilingual reasoning evaluation benchmark for llms
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5verdicts
UNVERDICTED 5representative citing papers
CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.
CulturALL is a new benchmark where even the best LLMs reach only 44.48% accuracy on grounded multilingual and multicultural tasks.
CroCo applies English-reward-ranked self-generations for contrastive preference tuning that improves two LLMs on structured and open-ended tasks across 14 languages without language-specific annotations.
A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.
citing papers explorer
-
SFBench: The SciFy Scientific Feasibility Benchmark
SFBench provides 197 expert-created materials science claims with feasibility scores and explanations to evaluate AI systems on scientific feasibility assessment.
-
CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs
CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.
-
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
CulturALL is a new benchmark where even the best LLMs reach only 44.48% accuracy on grounded multilingual and multicultural tasks.
-
CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations
CroCo applies English-reward-ranked self-generations for contrastive preference tuning that improves two LLMs on structured and open-ended tasks across 14 languages without language-specific annotations.
-
The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes
A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.