arXiv preprint arXiv:2305.17306 , year=

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance , author= · 2023 · arXiv 2305.17306

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

cs.CL · 2023-09-15 · unverdicted · novelty 7.0

EvoPrompt uses LLMs to run evolutionary operators on populations of prompts, outperforming human-engineered prompts by up to 25% on BIG-Bench Hard tasks across 31 datasets.

Teachers' Perceived Benefits and Risks of AI Across Fifty-Five Countries: An Audit of LLM Alignment and Steerability

cs.CY · 2026-05-08 · unverdicted · novelty 6.0

Teachers' views on AI benefits and risks vary widely across 55 countries, but LLMs compress these differences, overestimate both sides, and show little improvement from country prompting or better reasoning.

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

cs.CL · 2024-06-03 · conditional · novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

citing papers explorer

Showing 1 of 1 citing paper after filters.

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark cs.CL · 2024-06-03 · conditional · none · ref 16
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

arXiv preprint arXiv:2305.17306 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer