Multinrc: A challenging and native multilingual reasoning evaluation benchmark for llms

Alexander R Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing · 2025 · arXiv 2507.17476

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

representative citing papers

SFBench: The SciFy Scientific Feasibility Benchmark

cs.AI · 2026-06-28 · unverdicted · novelty 7.0

SFBench provides 197 expert-created materials science claims with feasibility scores and explanations to evaluate AI systems on scientific feasibility assessment.

CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

CulturALL is a new benchmark where even the best LLMs reach only 44.48% accuracy on grounded multilingual and multicultural tasks.

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

cs.CL · 2026-05-25 · unverdicted · novelty 5.0

CroCo applies English-reward-ranked self-generations for contrastive preference tuning that improves two LLMs on structured and open-ended tasks across 14 languages without language-specific annotations.

The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes

cs.CL · 2026-06-09 · unverdicted · novelty 4.0

A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.

citing papers explorer

Showing 5 of 5 citing papers after filters.

SFBench: The SciFy Scientific Feasibility Benchmark cs.AI · 2026-06-28 · unverdicted · none · ref 1
SFBench provides 197 expert-created materials science claims with feasibility scores and explanations to evaluate AI systems on scientific feasibility assessment.
CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs cs.CL · 2026-06-01 · unverdicted · none · ref 9
CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks cs.CL · 2026-04-21 · unverdicted · none · ref 1
CulturALL is a new benchmark where even the best LLMs reach only 44.48% accuracy on grounded multilingual and multicultural tasks.
CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations cs.CL · 2026-05-25 · unverdicted · none · ref 9
CroCo applies English-reward-ranked self-generations for contrastive preference tuning that improves two LLMs on structured and open-ended tasks across 14 languages without language-specific annotations.
The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes cs.CL · 2026-06-09 · unverdicted · none · ref 54
A literature survey that introduces a taxonomy for LLM reasoning paradigms, analyzes methodological trends, and synthesizes failure modes from over 300 papers.

Multinrc: A challenging and native multilingual reasoning evaluation benchmark for llms

fields

years

verdicts

representative citing papers

citing papers explorer