Cornetto is the first benchmark that synthesizes 231 network misconfiguration problems across topologies of 20-754 nodes and uses formal verification to show that nine state-of-the-art LLMs often introduce regressions and degrade at scale.
hub
React: Synergizing reasoning and acting in language models
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
SGR-Bench evaluates agentic LLM systems on state-gated retrieval tasks where evidence is only accessible after configuring site-specific states, with the strongest system reaching 66.18% item-level F1 and failures dominated by retrieval-scope drift.
HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.
STS repurposes draft-model attention scores from speculative decoding to build token-and-head-wise sparsity masks, delivering 2.67x speedup at ~90% sparsity on NarrativeQA with negligible accuracy loss.
GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.
PoC-Adapt improves automated PoC exploit generation reliability by 25% and lowers cost using semantic state validation and RL adaptive policies, verifying 12 PoCs from 80 recent CVE attempts at $0.42 each.
Template collapse is a distinct failure mode in agentic RL invisible to entropy; mutual information proxies diagnose it better and SNR-aware filtering using reward variance improves input-dependent reasoning and task performance across planning, math, navigation, and code tasks.
Testing 18 LLMs found 94.4% vulnerable to direct prompt injection for malware installation, 83.3% to RAG backdoor attacks, and 100% to inter-agent trust exploitation in multi-agent systems.
AgentPex extracts rules from prompts and automatically flags specification violations in agent execution traces that outcome-only benchmarks miss.
A small language model fine-tuned on tool-augmented chain-of-thought data generated by a larger LLM learns to selectively call tools, delivering better content moderation accuracy at lower inference cost.
AgenticAITA proposes a training-free multi-agent LLM framework for autonomous trading using a deliberative pipeline, Z-score triggers, and safety gates, shown to run correctly in a five-day live dry-run with 157 invocations.
ReflectiChain uses latent trajectory rehearsal and retrospective agentic RL inside an LLM world model to raise average step rewards by 250% and restore supply-chain operability from 13.3% to 88.5% on the Semi-Sim benchmark under extreme shocks.
citing papers explorer
-
Benchmarking LLM-Driven Network Configuration Repair
Cornetto is the first benchmark that synthesizes 231 network misconfiguration problems across topologies of 20-754 nodes and uses formal verification to show that nine state-of-the-art LLMs often introduce regressions and degrade at scale.
-
SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval
SGR-Bench evaluates agentic LLM systems on state-gated retrieval tasks where evidence is only accessible after configuring site-specific states, with the strongest system reaching 66.18% item-level F1 and failures dominated by retrieval-scope drift.
-
Harnessing LLM Agents with Skill Programs
HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.
-
STS: Efficient Sparse Attention with Speculative Token Sparsity
STS repurposes draft-model attention scores from speculative decoding to build token-and-head-wise sparsity masks, delivering 2.67x speedup at ~90% sparsity on NarrativeQA with negligible accuracy loss.
-
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.
-
PoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy
PoC-Adapt improves automated PoC exploit generation reliability by 25% and lowers cost using semantic state validation and RL adaptive policies, verifying 12 PoCs from 80 recent CVE attempts at $0.42 each.
-
RAGEN-2: Reasoning Collapse in Agentic RL
Template collapse is a distinct failure mode in agentic RL invisible to entropy; mutual information proxies diagnose it better and SNR-aware filtering using reward variance improves input-dependent reasoning and task performance across planning, math, navigation, and code tasks.
-
The Dark Side of LLMs: Agent-based Attack Vectors for System-level Compromise
Testing 18 LLMs found 94.4% vulnerable to direct prompt injection for malware installation, 83.3% to RAG backdoor attacks, and 100% to inter-agent trust exploitation in multi-agent systems.
-
Willful Disobedience: Automatically Detecting Failures in Agentic Traces
AgentPex extracts rules from prompts and automatically flags specification violations in agent execution traces that outcome-only benchmarks miss.
-
Tool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation
A small language model fine-tuned on tool-augmented chain-of-thought data generated by a larger LLM learns to selectively call tools, delivering better content moderation accuracy at lower inference cost.
-
AgenticAITA: A Proof-Of-Concept About Deliberative Multi-Agent Reasoning for Autonomous Trading Systems
AgenticAITA proposes a training-free multi-agent LLM framework for autonomous trading using a deliberative pipeline, Z-score triggers, and safety gates, shown to run correctly in a five-day live dry-run with 157 invocations.
-
From Topology to Trajectory: LLM-Driven World Models For Supply Chain Resilience
ReflectiChain uses latent trajectory rehearsal and retrospective agentic RL inside an LLM world model to raise average step rewards by 250% and restore supply-chain operability from 13.3% to 88.5% on the Semi-Sim benchmark under extreme shocks.