The Innovation7(6), 101253 (2026)

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al · 2024 · arXiv 2025.101253

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Instance-Optimal Estimation with Multiple LLM Judges on a Budget

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.

SASAV: Self-Directed Agent for Scientific Analysis and Visualization

cs.GR · 2026-04-03 · unverdicted · novelty 7.0

SASAV introduces the first fully autonomous multi-agent system for scientific data analysis and visualization that operates without external prompting or human-in-the-loop feedback.

Civil Court Simulation with Large Language Models

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

Multi-agent LLM framework simulates Chinese civil trials through five-stage procedures with memory and retrieval, producing judgments strong in liability allocation and multi-item decisions.

DN-Hypo-Pipeline: An AI-Driven Workflow for Generating Hypotheses using Large Language Models and Scientific Explanations

cs.AI · 2026-06-07 · unverdicted · novelty 6.0

DN-Hypo-Pipeline operationalizes three philosophy-of-science accounts to direct LLMs toward principle-based hypothesis generation, claims superior performance over direct prompting, and derives two new transformer algorithms from the resulting hypotheses.

The AI Epistemic Deference Index: A Continuous Measure of Sycophancy

cs.AI · 2026-06-05 · unverdicted · novelty 6.0

Proposes the AI Epistemic Deference Index (AEDI) as a continuous measure of epistemic sycophancy and applies it to eight models using a new LLM-judge protocol on 500 propositions and 16,000 prompts.

Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

cs.CL · 2026-06-05 · unverdicted · novelty 6.0

A new evaluation framework shows that even the best tested LLM only reliably adjusts response complexity in the intended direction 46% of the time across 98 scientific queries.

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

POIROT protocol repurposes agents in LLM multi-agent systems as an internal diagnostic layer for failure detection, outperforming single-LLM evaluators with gains that increase with complexity, agent count, and fault types.

"I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents

cs.CR · 2026-05-30 · unverdicted · novelty 6.0

New benchmark Scammer4U finds 54-93% critical PII leakage from frontier web agents on scam sites versus 0% on benign twins, plus a 30-point gap between verbalized suspicion and actual submission.

Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks

cs.CL · 2026-05-05 · unverdicted · novelty 6.0

Introduces a clean matched benchmark and Dynamic Emotional Signature Graphs (DESG) framework that detects implicit sycophancy via clinical-state transitions and reports a 0.0488 macro-F1 gain over baselines on harmful-risk detection.

Agentic GraphRAG: Navigating Unstructured Financial Data with Collaborative AI

cs.IR · 2026-04-15 · unverdicted · novelty 6.0

Agentic GraphRAG constructs a Neo4j graph via deterministic structured ingestion plus LLM extraction from notices, then deploys modular agents with tool access and reflection to outperform vector-RAG baselines on Swiss commercial gazette data across entity resolution, answer quality, and multi-turn

RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents

cs.CL · 2026-04-13 · unverdicted · novelty 6.0

RPA-Check is a new multi-stage framework using dimension definition, boolean checklist augmentation, semantic filtering, and LLM-as-judge verification to assess role-playing agents, with tests on a legal training game showing smaller instruction-tuned models can be more consistent than larger ones.

Bash-Commenter: Leveraging Syntax-Aware Preference Optimization to Reinforce Large Language Model for Bash Code Comment Generation

cs.SE · 2026-06-29 · unverdicted · novelty 5.0

Bash-Commenter applies CPT, SFT, and Syntax-Aware Preference Optimization (SAPO) via AST atomic operations to LLaMA-3.1-8B, reporting higher BLEU-4/METEOR/ROUGE-L scores than baselines on single-line and multi-line Bash comment generation tasks.

A French OSCE Dialogue Dataset and Controllable Virtual Patient System for Clinical Training

cs.CL · 2026-06-26 · unverdicted · novelty 5.0

Introduces a French OSCE dialogue dataset of 240 interactions and a modular LLM-based controllable virtual patient generation system with multi-level LLM-as-Judge evaluation for clinical skills training.

POLARIS: Guiding Small Models to Write Long Stories

cs.CL · 2026-06-02 · unverdicted · novelty 5.0

POLARIS trains Qwen3.5-9B via GRPO with LLM-as-judge rewards and human-reference injection, yielding a model competitive with larger open-weight models on length adherence and quality, including generalization to 3x training length.

NEURON: A Neuro-symbolic System for Grounded Clinical Explainability

cs.AI · 2026-05-02 · unverdicted · novelty 5.0 · 2 refs

NEURON integrates SNOMED CT, ML, and RAG LLM to raise AUC from 0.74-0.77 to 0.84-0.88 and human-aligned explainability scores from 0.50 to 0.85 on MIMIC-IV acute heart failure data.

AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows

cs.IR · 2026-01-19 · unverdicted · novelty 4.0

RAG-based LLM extraction reaches 89% accuracy on clinical trial protocols versus 62.6% for standalone models and cuts simulated workflow time by 40%.

citing papers explorer

Showing 16 of 16 citing papers after filters.

Instance-Optimal Estimation with Multiple LLM Judges on a Budget cs.LG · 2026-05-22 · unverdicted · none · ref 5
Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.
SASAV: Self-Directed Agent for Scientific Analysis and Visualization cs.GR · 2026-04-03 · unverdicted · none · ref 17
SASAV introduces the first fully autonomous multi-agent system for scientific data analysis and visualization that operates without external prompting or human-in-the-loop feedback.
Civil Court Simulation with Large Language Models cs.CL · 2026-06-08 · unverdicted · none · ref 10
Multi-agent LLM framework simulates Chinese civil trials through five-stage procedures with memory and retrieval, producing judgments strong in liability allocation and multi-item decisions.
DN-Hypo-Pipeline: An AI-Driven Workflow for Generating Hypotheses using Large Language Models and Scientific Explanations cs.AI · 2026-06-07 · unverdicted · none · ref 46
DN-Hypo-Pipeline operationalizes three philosophy-of-science accounts to direct LLMs toward principle-based hypothesis generation, claims superior performance over direct prompting, and derives two new transformer algorithms from the resulting hypotheses.
The AI Epistemic Deference Index: A Continuous Measure of Sycophancy cs.AI · 2026-06-05 · unverdicted · none · ref 2
Proposes the AI Epistemic Deference Index (AEDI) as a continuous measure of epistemic sycophancy and applies it to eight models using a new LLM-judge protocol on 500 propositions and 16,000 prompts.
Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses cs.CL · 2026-06-05 · unverdicted · none · ref 63
A new evaluation framework shows that even the best tested LLM only reliably adjusts response complexity in the intended direction 46% of the time across 98 scientific queries.
POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems cs.AI · 2026-06-01 · unverdicted · none · ref 19
POIROT protocol repurposes agents in LLM multi-agent systems as an internal diagnostic layer for failure detection, outperforming single-LLM evaluators with gains that increase with complexity, agent count, and fault types.
"I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents cs.CR · 2026-05-30 · unverdicted · none · ref 132
New benchmark Scammer4U finds 54-93% critical PII leakage from frontier web agents on scam sites versus 0% on benign twins, plus a 30-point gap between verbalized suspicion and actual submission.
Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks cs.CL · 2026-05-05 · unverdicted · none · ref 9
Introduces a clean matched benchmark and Dynamic Emotional Signature Graphs (DESG) framework that detects implicit sycophancy via clinical-state transitions and reports a 0.0488 macro-F1 gain over baselines on harmful-risk detection.
Agentic GraphRAG: Navigating Unstructured Financial Data with Collaborative AI cs.IR · 2026-04-15 · unverdicted · none · ref 9
Agentic GraphRAG constructs a Neo4j graph via deterministic structured ingestion plus LLM extraction from notices, then deploys modular agents with tool access and reflection to outperform vector-RAG baselines on Swiss commercial gazette data across entity resolution, answer quality, and multi-turn
RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents cs.CL · 2026-04-13 · unverdicted · none · ref 9
RPA-Check is a new multi-stage framework using dimension definition, boolean checklist augmentation, semantic filtering, and LLM-as-judge verification to assess role-playing agents, with tests on a legal training game showing smaller instruction-tuned models can be more consistent than larger ones.
Bash-Commenter: Leveraging Syntax-Aware Preference Optimization to Reinforce Large Language Model for Bash Code Comment Generation cs.SE · 2026-06-29 · unverdicted · none · ref 12
Bash-Commenter applies CPT, SFT, and Syntax-Aware Preference Optimization (SAPO) via AST atomic operations to LLaMA-3.1-8B, reporting higher BLEU-4/METEOR/ROUGE-L scores than baselines on single-line and multi-line Bash comment generation tasks.
A French OSCE Dialogue Dataset and Controllable Virtual Patient System for Clinical Training cs.CL · 2026-06-26 · unverdicted · none · ref 2
Introduces a French OSCE dialogue dataset of 240 interactions and a modular LLM-based controllable virtual patient generation system with multi-level LLM-as-Judge evaluation for clinical skills training.
POLARIS: Guiding Small Models to Write Long Stories cs.CL · 2026-06-02 · unverdicted · none · ref 36
POLARIS trains Qwen3.5-9B via GRPO with LLM-as-judge rewards and human-reference injection, yielding a model competitive with larger open-weight models on length adherence and quality, including generalization to 3x training length.
NEURON: A Neuro-symbolic System for Grounded Clinical Explainability cs.AI · 2026-05-02 · unverdicted · none · ref 36 · 2 links
NEURON integrates SNOMED CT, ML, and RAG LLM to raise AUC from 0.74-0.77 to 0.84-0.88 and human-aligned explainability scores from 0.50 to 0.85 on MIMIC-IV acute heart failure data.
AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows cs.IR · 2026-01-19 · unverdicted · none · ref 25
RAG-based LLM extraction reaches 89% accuracy on clinical trial protocols versus 62.6% for standalone models and cuts simulated workflow time by 40%.

The Innovation7(6), 101253 (2026)

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer