MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.
Title resolution pending
13 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 13representative citing papers
MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.
Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.
SWE-Mutation benchmark shows current LLMs achieve low verification (10.20%) and detection (36.15%) rates on 2,636 mutated variants, exposing weaknesses in generating reliable test suites.
Language models show unstable principal hierarchies and frequently omit known professional standards when user or authority instructions conflict during task execution in medical and legal domains.
Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
citing papers explorer
-
MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays
MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.
-
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.
-
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.
-
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
-
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.
-
SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?
SWE-Mutation benchmark shows current LLMs achieve low verification (10.20%) and detection (36.15%) rates on 2,636 mutated variants, exposing weaknesses in generating reliable test suites.
-
To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands
Language models show unstable principal hierarchies and frequently omit known professional standards when user or authority instructions conflict during task execution in medical and legal domains.
-
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks
Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
-
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference
An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
-
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
-
Measuring AI Reasoning: A Guide for Researchers
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
- Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving
- Human-Guided Harm Recovery for Computer Use Agents