Reasoning models judge better than non-reasoning LLMs yet retain biases; generating an evaluation plan first mitigates bias without losing accuracy.
Reasoning model is stub- born: Diagnosing instruction overriding in reasoning models.arXiv preprint arXiv:2505.17225
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
M2CL trains per-agent context generators with a self-adaptive mechanism to maintain coherence and reduce output discrepancies in multi-LLM discussions, yielding 20-50% gains on reasoning, embodied, and mobile control tasks.
Adding controlled noise and irrelevant persona contexts across training and testing stages for strong LLMs yields better reasoning and efficiency than high-quality data alone, reaching 76.7% on AIME24/25 with Qwen2.5-32B.
citing papers explorer
-
Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases
Reasoning models judge better than non-reasoning LLMs yet retain biases; generating an evaluation plan first mitigates bias without losing accuracy.
-
Context Learning for Multi-Agent Discussion
M2CL trains per-agent context generators with a self-adaptive mechanism to maintain coherence and reduce output discrepancies in multi-LLM discussions, yielding 20-50% gains on reasoning, embodied, and mobile control tasks.
-
Input-Time Scaling: Adding Noise and Irrelevance into Less-Is-More Drastically Improves Reasoning Performance and Efficiency
Adding controlled noise and irrelevant persona contexts across training and testing stages for strong LLMs yields better reasoning and efficiency than high-quality data alone, reaching 76.7% on AIME24/25 with Qwen2.5-32B.