Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.
The Innovation7(6), 101253 (2026)
16 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 16verdicts
UNVERDICTED 16representative citing papers
SASAV introduces the first fully autonomous multi-agent system for scientific data analysis and visualization that operates without external prompting or human-in-the-loop feedback.
Multi-agent LLM framework simulates Chinese civil trials through five-stage procedures with memory and retrieval, producing judgments strong in liability allocation and multi-item decisions.
DN-Hypo-Pipeline operationalizes three philosophy-of-science accounts to direct LLMs toward principle-based hypothesis generation, claims superior performance over direct prompting, and derives two new transformer algorithms from the resulting hypotheses.
Proposes the AI Epistemic Deference Index (AEDI) as a continuous measure of epistemic sycophancy and applies it to eight models using a new LLM-judge protocol on 500 propositions and 16,000 prompts.
A new evaluation framework shows that even the best tested LLM only reliably adjusts response complexity in the intended direction 46% of the time across 98 scientific queries.
POIROT protocol repurposes agents in LLM multi-agent systems as an internal diagnostic layer for failure detection, outperforming single-LLM evaluators with gains that increase with complexity, agent count, and fault types.
New benchmark Scammer4U finds 54-93% critical PII leakage from frontier web agents on scam sites versus 0% on benign twins, plus a 30-point gap between verbalized suspicion and actual submission.
Introduces a clean matched benchmark and Dynamic Emotional Signature Graphs (DESG) framework that detects implicit sycophancy via clinical-state transitions and reports a 0.0488 macro-F1 gain over baselines on harmful-risk detection.
Agentic GraphRAG constructs a Neo4j graph via deterministic structured ingestion plus LLM extraction from notices, then deploys modular agents with tool access and reflection to outperform vector-RAG baselines on Swiss commercial gazette data across entity resolution, answer quality, and multi-turn
RPA-Check is a new multi-stage framework using dimension definition, boolean checklist augmentation, semantic filtering, and LLM-as-judge verification to assess role-playing agents, with tests on a legal training game showing smaller instruction-tuned models can be more consistent than larger ones.
Bash-Commenter applies CPT, SFT, and Syntax-Aware Preference Optimization (SAPO) via AST atomic operations to LLaMA-3.1-8B, reporting higher BLEU-4/METEOR/ROUGE-L scores than baselines on single-line and multi-line Bash comment generation tasks.
Introduces a French OSCE dialogue dataset of 240 interactions and a modular LLM-based controllable virtual patient generation system with multi-level LLM-as-Judge evaluation for clinical skills training.
POLARIS trains Qwen3.5-9B via GRPO with LLM-as-judge rewards and human-reference injection, yielding a model competitive with larger open-weight models on length adherence and quality, including generalization to 3x training length.
NEURON integrates SNOMED CT, ML, and RAG LLM to raise AUC from 0.74-0.77 to 0.84-0.88 and human-aligned explainability scores from 0.50 to 0.85 on MIMIC-IV acute heart failure data.
RAG-based LLM extraction reaches 89% accuracy on clinical trial protocols versus 62.6% for standalone models and cuts simulated workflow time by 40%.
citing papers explorer
-
Instance-Optimal Estimation with Multiple LLM Judges on a Budget
Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.
-
SASAV: Self-Directed Agent for Scientific Analysis and Visualization
SASAV introduces the first fully autonomous multi-agent system for scientific data analysis and visualization that operates without external prompting or human-in-the-loop feedback.
-
Civil Court Simulation with Large Language Models
Multi-agent LLM framework simulates Chinese civil trials through five-stage procedures with memory and retrieval, producing judgments strong in liability allocation and multi-item decisions.
-
DN-Hypo-Pipeline: An AI-Driven Workflow for Generating Hypotheses using Large Language Models and Scientific Explanations
DN-Hypo-Pipeline operationalizes three philosophy-of-science accounts to direct LLMs toward principle-based hypothesis generation, claims superior performance over direct prompting, and derives two new transformer algorithms from the resulting hypotheses.
-
The AI Epistemic Deference Index: A Continuous Measure of Sycophancy
Proposes the AI Epistemic Deference Index (AEDI) as a continuous measure of epistemic sycophancy and applies it to eight models using a new LLM-judge protocol on 500 propositions and 16,000 prompts.
-
Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses
A new evaluation framework shows that even the best tested LLM only reliably adjusts response complexity in the intended direction 46% of the time across 98 scientific queries.
-
POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems
POIROT protocol repurposes agents in LLM multi-agent systems as an internal diagnostic layer for failure detection, outperforming single-LLM evaluators with gains that increase with complexity, agent count, and fault types.
-
"I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents
New benchmark Scammer4U finds 54-93% critical PII leakage from frontier web agents on scam sites versus 0% on benign twins, plus a 30-point gap between verbalized suspicion and actual submission.
-
Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks
Introduces a clean matched benchmark and Dynamic Emotional Signature Graphs (DESG) framework that detects implicit sycophancy via clinical-state transitions and reports a 0.0488 macro-F1 gain over baselines on harmful-risk detection.
-
Agentic GraphRAG: Navigating Unstructured Financial Data with Collaborative AI
Agentic GraphRAG constructs a Neo4j graph via deterministic structured ingestion plus LLM extraction from notices, then deploys modular agents with tool access and reflection to outperform vector-RAG baselines on Swiss commercial gazette data across entity resolution, answer quality, and multi-turn
-
RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents
RPA-Check is a new multi-stage framework using dimension definition, boolean checklist augmentation, semantic filtering, and LLM-as-judge verification to assess role-playing agents, with tests on a legal training game showing smaller instruction-tuned models can be more consistent than larger ones.
-
Bash-Commenter: Leveraging Syntax-Aware Preference Optimization to Reinforce Large Language Model for Bash Code Comment Generation
Bash-Commenter applies CPT, SFT, and Syntax-Aware Preference Optimization (SAPO) via AST atomic operations to LLaMA-3.1-8B, reporting higher BLEU-4/METEOR/ROUGE-L scores than baselines on single-line and multi-line Bash comment generation tasks.
-
A French OSCE Dialogue Dataset and Controllable Virtual Patient System for Clinical Training
Introduces a French OSCE dialogue dataset of 240 interactions and a modular LLM-based controllable virtual patient generation system with multi-level LLM-as-Judge evaluation for clinical skills training.
-
POLARIS: Guiding Small Models to Write Long Stories
POLARIS trains Qwen3.5-9B via GRPO with LLM-as-judge rewards and human-reference injection, yielding a model competitive with larger open-weight models on length adherence and quality, including generalization to 3x training length.
-
NEURON: A Neuro-symbolic System for Grounded Clinical Explainability
NEURON integrates SNOMED CT, ML, and RAG LLM to raise AUC from 0.74-0.77 to 0.84-0.88 and human-aligned explainability scores from 0.50 to 0.85 on MIMIC-IV acute heart failure data.
-
AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows
RAG-based LLM extraction reaches 89% accuracy on clinical trial protocols versus 62.6% for standalone models and cuts simulated workflow time by 40%.