AGIE val: A human-centric benchmark for evaluating foundation models

Zhong, W · 2024 · DOI 10.18653/v1/2024.findings-naacl.149

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open at publisher browse 4 citing papers

representative citing papers

Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism

cs.LO · 2026-04-07 · unverdicted · novelty 7.0

ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.

MEASER: Malware embedding attacks on open-source LLMs

cs.CR · 2025-10-12 · unverdicted · novelty 6.0

MEASER embeds malware into open-source LLMs via parameter targeting and MAR-QIM modulation, achieving 0 BER and high stealth even after quantization and PEFT.

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

cs.CL · 2024-10-23 · conditional · novelty 6.0

Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.

Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

cs.CL · 2025-02-05 · unverdicted · novelty 2.0

Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

citing papers explorer

Showing 4 of 4 citing papers.

Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism cs.LO · 2026-04-07 · unverdicted · none · ref 124
ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
MEASER: Malware embedding attacks on open-source LLMs cs.CR · 2025-10-12 · unverdicted · none · ref 44
MEASER embeds malware into open-source LLMs via parameter targeting and MAR-QIM modulation, achieving 0 BER and high stealth even after quantization and PEFT.
Scaling Diffusion Language Models via Adaptation from Autoregressive Models cs.CL · 2024-10-23 · conditional · none · ref 207
Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 264
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

AGIE val: A human-centric benchmark for evaluating foundation models

fields

years

verdicts

representative citing papers

citing papers explorer