ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
AGIE val: A human-centric benchmark for evaluating foundation models
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
MEASER embeds malware into open-source LLMs via parameter targeting and MAR-QIM modulation, achieving 0 BER and high stealth even after quantization and PEFT.
Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.
citing papers explorer
-
Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism
ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
-
MEASER: Malware embedding attacks on open-source LLMs
MEASER embeds malware into open-source LLMs via parameter targeting and MAR-QIM modulation, achieving 0 BER and high stealth even after quantization and PEFT.
-
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.
-
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.