Title resolution pending

OpenAI , title = · 2025

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

cs.CV · 2026-05-15 · conditional · novelty 8.0

MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.

Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.

SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?

cs.SE · 2026-05-21 · unverdicted · novelty 6.0

SWE-Mutation benchmark shows current LLMs achieve low verification (10.20%) and detection (36.15%) rates on 2,636 mutated variants, exposing weaknesses in generating reliable test suites.

To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Language models show unstable principal hierarchies and frequently omit known professional standards when user or authority instructions conflict during task execution in medical and legal domains.

Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.

On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference

cs.CR · 2026-05-06 · conditional · novelty 6.0

An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.

LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

cs.CL · 2026-05-21 · unverdicted · novelty 5.0

LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.

Measuring AI Reasoning: A Guide for Researchers

cs.AI · 2026-05-04 · unverdicted · novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

cs.CL · 2026-04-22

Human-Guided Harm Recovery for Computer Use Agents

cs.AI · 2026-04-20

citing papers explorer

Showing 13 of 13 citing papers.

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays cs.CV · 2026-05-15 · conditional · none · ref 41
MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning cs.CV · 2026-05-21 · unverdicted · none · ref 51
MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents cs.CL · 2026-05-21 · unverdicted · none · ref 21
Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain cs.CL · 2026-05-09 · unverdicted · none · ref 11
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving cs.CL · 2026-04-23 · unverdicted · none · ref 78
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.
SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering? cs.SE · 2026-05-21 · unverdicted · none · ref 51
SWE-Mutation benchmark shows current LLMs achieve low verification (10.20%) and detection (36.15%) rates on 2,636 mutated variants, exposing weaknesses in generating reliable test suites.
To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands cs.AI · 2026-05-12 · unverdicted · none · ref 41
Language models show unstable principal hierarchies and frequently omit known professional standards when user or authority instructions conflict during task execution in medical and legal domains.
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks cs.LG · 2026-05-11 · unverdicted · none · ref 9
Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference cs.CR · 2026-05-06 · conditional · none · ref 18
An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance cs.CL · 2026-05-21 · unverdicted · none · ref 41
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
Measuring AI Reasoning: A Guide for Researchers cs.AI · 2026-05-04 · unverdicted · none · ref 62
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving cs.CL · 2026-04-22 · unreviewed · ref 62
Human-Guided Harm Recovery for Computer Use Agents cs.AI · 2026-04-20 · unreviewed · ref 31

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer