The first benchmark for AI-generated scientific figure detection shows existing detectors fail in zero-shot transfer, overfit to specific generators, and break under common image corruptions.
Canonical reference
Title resolution pending
Canonical reference. 80% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 19representative citing papers
X-SYNTH synthesizes enterprise context from digital human attention using Digital Twin Signatures and seven attention filters, raising true lead rate from 9.5% to 61.9% while cutting false lead rate to 18.8%.
SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across multiple LLMs.
A co-creation process for inferring and refining personal strivings from computer activity logs yields more representative goals and higher user agency than baselines in a 14-person week-long study.
LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.
EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.
A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.
PiLLar is the first LLM-guided Monte-Carlo Tree Search framework for joint schema-value matching on pivot tables, achieving 87.94% average accuracy on a new benchmark PTbench derived from real-world domains.
EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
CHORUS multi-agent system reduced professional translation time by 33.8% while lowering cognitive effort and raising BLEU/COMET scores in a 30-participant within-subject study.
MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.
A multi-agent SDD framework with phase-level context-grounding hooks improves LLM-judged quality by 0.15 points and SWE-bench Lite Pass@1 by 1.7 percent while preserving near-perfect test compatibility.
SPRINT refines LLM-generated intents for session-based recommendation via a global intent pool, performance validation, selective LLM invocation during training, and a lightweight intent predictor for scalable inference without LLM calls.
CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.
A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.
citing papers explorer
-
SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection
The first benchmark for AI-generated scientific figure detection shows existing detectors fail in zero-shot transfer, overfit to specific generators, and break under common image corruptions.
-
X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Digital Human Attention
X-SYNTH synthesizes enterprise context from digital human attention using Digital Twin Signatures and seven attention filters, raising true lead rate from 9.5% to 61.9% while cutting false lead rate to 18.8%.
-
SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking
SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across multiple LLMs.
-
"What Are You Really Trying to Do?": Co-Creating Life Goals from Everyday Computer Use
A co-creation process for inferring and refining personal strivings from computer activity logs yields more representative goals and higher user agency than baselines in a 14-person week-long study.
-
Discovering Agentic Safety Specifications from 1-Bit Danger Signals
LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.
-
SAGE: A Service Agent Graph-guided Evaluation Benchmark
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.
-
Toward Executable Repository-Level Code Generation via Environment Alignment
EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.
-
optimize_anything: A Universal API for Optimizing any Text Parameter
A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.
-
PiLLar: Matching for Pivot Table Schema via LLM-guided Monte-Carlo Tree Search
PiLLar is the first LLM-guided Monte-Carlo Tree Search framework for joint schema-value matching on pivot tables, achieving 87.94% average accuracy on a new benchmark PTbench derived from real-world domains.
-
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
-
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
-
CHORUS: Effort-Aware Multi-Agent Human-AI Collaboration for Professional Translation
CHORUS multi-agent system reduced professional translation time by 33.8% while lowering cognitive effort and raising BLEU/COMET scores in a 30-participant within-subject study.
-
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
-
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
-
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images
TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.
-
Spec Kit Agents: Context-Grounded Agentic Workflows
A multi-agent SDD framework with phase-level context-grounding hooks improves LLM-judged quality by 0.15 points and SWE-bench Lite Pass@1 by 1.7 percent while preserving near-perfect test compatibility.
-
SPRINT: Scalable and Predictive Intent Refinement for LLM-Enhanced Session-based Recommendation
SPRINT refines LLM-generated intents for session-based recommendation via a global intent pool, performance validation, selective LLM invocation during training, and a lightweight intent predictor for scalable inference without LLM calls.
-
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.
-
A Survey of Scaling in Large Language Model Reasoning
A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.