APACrefauthors \ 1947

Quinn McNemar · 1947 · Psychometrika · DOI 10.1007/bf02295996

17 Pith papers cite this work, alongside 3,350 external citations. Polarity classification is still indexing.

17 Pith papers citing it

3,350 external citations · Crossref

open at publisher browse 17 citing papers

citation-role summary

background 2 method 2

citation-polarity summary

background 2 use method 2

representative citing papers

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

cs.AI · 2026-05-17 · unverdicted · novelty 8.0

A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.

Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Chart information is encoded but not routed to predictions in VLMs for claim verification, unlike tables, revealed by layer-wise probing and attention analysis on three models.

ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

LLM reasoning traces can be compiled into reusable symbolic solvers that achieve high accuracy on program synthesis benchmarks at zero inference cost and transfer to other domains.

How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews

cs.IR · 2026-04-30 · unverdicted · novelty 7.0

AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.

Evaluating Plan Compliance in Autonomous Programming Agents

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade it more than no plan.

Evaluating LLM Agents on Automated Software Analysis Tasks

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.

A Large-Scale Empirical Study of AI-Generated Code in Real-World Repositories

cs.SE · 2026-03-28 · unverdicted · novelty 7.0

A large-scale study of real-world repositories finds that AI-generated code differs from human-written code in complexity, structural traits, defect indicators, and commit-level activity patterns.

Chains That See, Answers That Don't: A Multi-Aspect Evaluation Recipe for Forced Chain-of-Thought on Video-MME

cs.CV · 2026-06-22 · conditional · novelty 6.0

Forced CoT produces video-dependent reasoning chains but does not improve MCQ accuracy on Qwen2.5-VL with Video-MME and causes a small drop on the 7B variant.

Honeyquest for LLMs: Rethinking Cyber Deception for AI Attackers

cs.CR · 2026-06-19 · unverdicted · novelty 6.0

LLMs fall for deceptive traps at higher rates than humans, lack the human attention-diversion effect, and exploit traps 73.4% of the time even after recognizing them in reasoning.

Spectral Vision Transformer for Efficient Tokenization with Limited Data

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

A spectral vision transformer achieves equitable or superior performance with fewer parameters than standard ViTs, CNNs, and other models by using spectral projections for tokenization in limited-data medical imaging.

SOCpilot: Verifying Policy Compliance for LLM-Assisted Incident Response

cs.CR · 2026-05-06 · unverdicted · novelty 6.0

SOCpilot supplies a fixed verifier and public artifact that removes 466 non-compliant approval-gated actions from LLM plans on 200 real incidents while preserving task recall.

PLUME: Probabilistic Latent Unified World Modeling and Parameter Estimation for Multi-Finger Manipulation

cs.RO · 2026-06-09 · unverdicted · novelty 5.0

PLUME jointly models parameter beliefs and conditioned dynamics in a latent space for dexterous manipulation, enabling zero-shot sim-to-real transfer that outperforms offline RL and behavior cloning baselines on turning, lifting, and flicking tasks.

When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

cs.CL · 2026-05-22 · unverdicted · novelty 5.0

LLMs reach moderate accuracy on a new psychiatric interview benchmark but systematically discount explicit symptoms when preserved functioning or protective factors are present.

Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy

cs.CV · 2026-05-13 · unverdicted · novelty 5.0

TREX detects rectal cancer local regrowth from longitudinal endoscopy image pairs with 97% sensitivity and enables early prediction 3-12 months before clinical confirmation, outperforming baselines.

Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes

cs.IR · 2026-05-06 · unverdicted · novelty 5.0

Crowdsourced judgments reliably flag authentic videos but frequently miss manipulations and struggle to identify whether changes are audio-only, video-only, or both.

An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code

cs.SE · 2026-04-25 · unverdicted · novelty 4.0

Locally deployed LLMs achieve 43-45% accuracy on Python bug detection but frequently produce only partial identifications of problematic code regions.

citing papers explorer

Showing 17 of 17 citing papers.

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps cs.AI · 2026-05-17 · unverdicted · none · ref 19
A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models cs.SE · 2026-06-30 · unverdicted · none · ref 30
Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.
Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification cs.CL · 2026-06-01 · unverdicted · none · ref 24
Chart information is encoded but not routed to predictions in VLMs for claim verification, unlike tables, revealed by layer-wise probing and attention analysis on three models.
ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis cs.CL · 2026-05-06 · unverdicted · none · ref 6
LLM reasoning traces can be compiled into reusable symbolic solvers that achieve high accuracy on program synthesis benchmarks at zero inference cost and transfer to other domains.
How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews cs.IR · 2026-04-30 · unverdicted · none · ref 39
AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.
Evaluating Plan Compliance in Autonomous Programming Agents cs.SE · 2026-04-13 · unverdicted · none · ref 21
Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade it more than no plan.
Evaluating LLM Agents on Automated Software Analysis Tasks cs.SE · 2026-04-13 · unverdicted · none · ref 41
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
A Large-Scale Empirical Study of AI-Generated Code in Real-World Repositories cs.SE · 2026-03-28 · unverdicted · none · ref 24
A large-scale study of real-world repositories finds that AI-generated code differs from human-written code in complexity, structural traits, defect indicators, and commit-level activity patterns.
Chains That See, Answers That Don't: A Multi-Aspect Evaluation Recipe for Forced Chain-of-Thought on Video-MME cs.CV · 2026-06-22 · conditional · none · ref 21
Forced CoT produces video-dependent reasoning chains but does not improve MCQ accuracy on Qwen2.5-VL with Video-MME and causes a small drop on the 7B variant.
Honeyquest for LLMs: Rethinking Cyber Deception for AI Attackers cs.CR · 2026-06-19 · unverdicted · none · ref 34
LLMs fall for deceptive traps at higher rates than humans, lack the human attention-diversion effect, and exploit traps 73.4% of the time even after recognizing them in reasoning.
Spectral Vision Transformer for Efficient Tokenization with Limited Data cs.CV · 2026-05-12 · unverdicted · none · ref 55
A spectral vision transformer achieves equitable or superior performance with fewer parameters than standard ViTs, CNNs, and other models by using spectral projections for tokenization in limited-data medical imaging.
SOCpilot: Verifying Policy Compliance for LLM-Assisted Incident Response cs.CR · 2026-05-06 · unverdicted · none · ref 24
SOCpilot supplies a fixed verifier and public artifact that removes 466 non-compliant approval-gated actions from LLM plans on 200 real incidents while preserving task recall.
PLUME: Probabilistic Latent Unified World Modeling and Parameter Estimation for Multi-Finger Manipulation cs.RO · 2026-06-09 · unverdicted · none · ref 44
PLUME jointly models parameter beliefs and conditioned dynamics in a latent space for dexterous manipulation, enabling zero-shot sim-to-real transfer that outperforms offline RL and behavior cloning baselines on turning, lifting, and flicking tasks.
When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening cs.CL · 2026-05-22 · unverdicted · none · ref 2
LLMs reach moderate accuracy on a new psychiatric interview benchmark but systematically discount explicit symptoms when preserved functioning or protective factors are present.
Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy cs.CV · 2026-05-13 · unverdicted · none · ref 55
TREX detects rectal cancer local regrowth from longitudinal endoscopy image pairs with 97% sensitivity and enables early prediction 3-12 months before clinical confirmation, outperforming baselines.
Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes cs.IR · 2026-05-06 · unverdicted · none · ref 72
Crowdsourced judgments reliably flag authentic videos but frequently miss manipulations and struggle to identify whether changes are audio-only, video-only, or both.
An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code cs.SE · 2026-04-25 · unverdicted · none · ref 20
Locally deployed LLMs achieve 43-45% accuracy on Python bug detection but frequently produce only partial identifications of problematic code regions.

APACrefauthors \ 1947

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer