MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.
hub Canonical reference
Qwen3 technical report
Canonical reference. 75% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CosFlyTrack provides 12,000 expert UAV trajectories with aligned RGB, depth, segmentation, pose, target state, and bilingual instructions to train visual tracking agents, yielding 53-69 point gains in success rate after fine-tuning.
Dynamic Latent Routing jointly learns discrete latent codes, routing policies, and model parameters via dynamic search to match or exceed supervised fine-tuning by 6.6 points on average in low-data settings across four datasets and six models.
PPol uses LLM-driven evolutionary program search to create diverse human-like user personas for simulators, yielding 33-62% fitness gains and +17% agent task success on retail and airline domains.
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
SAM 3D Animal is the first promptable framework for multi-animal 3D reconstruction from single images, built on SMAL+ and trained on the new Herd3D dataset, achieving SOTA results on Animal3D, APTv2, and Animal Kingdom benchmarks.
An automated fact-check-based pipeline for in-the-wild AI image data, when mixed with generator data in continual learning, lets detectors adapt to new generators while avoiding forgetting and delivers 8-9% accuracy gains on two existing models.
Introduces an interactive episodic memory task with user feedback and a Feedback Alignment Module that improves retrieval accuracy on video benchmarks while remaining efficient.
LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new BinDeObfBench.
GraphSSR introduces an adaptive SSR pipeline with SSR-SFT data synthesis and SSR-RL (Authenticity-Reinforced and Denoising-Reinforced stages) to overcome one-size-fits-all subgraph noise in zero-shot LLM graph reasoning.
ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.
InsightReplay improves long CoT reasoning by extracting critical insights from the trace and replaying them near the active frontier, delivering +1.65 average accuracy gain across 24 model-benchmark settings.
Federated PEFT on LLMs across healthcare and finance datasets performs close to centralized training and beats isolated local training under non-IID conditions.
GRIEF fuzzer finds 15 vulnerabilities including 2 CVEs in vLLM and SGLang by testing concurrent workloads for KV-cache isolation failures and cross-request interference.
ξ-DPO rewrites the preference objective as minimizing distance to optimal margins and defines reward as a chosen-to-rejected ratio, yielding a bounded, interpretable margin ξ set directly from the initial reward-gap distribution.
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and allowing a 14B model to beat Gemini-2.5-Flash.
Prometheus reverse-engineers BDD-style executable specifications from bug failures and uses an RQA validation loop to achieve 93.97% correct patch rate on 680 Defects4J defects while rescuing 74.4% of bugs missed by strong baseline agents.
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.
LERA is a retrieve-then-generate auction system that refines ad candidate ranking with LLM logits and applies a threshold-aware critical-value payment rule to maintain truthfulness in chatbot ad insertion.
Squeeze Evolve is a multi-model orchestration framework that improves efficiency and performance in verifier-free evolutionary inference, cutting costs up to 3x while matching verifier-based methods on several benchmarks.
CoLLM-NAS introduces a collaborative two-LLM framework with Navigator, Generator, and Coordinator modules to perform knowledge-guided neural architecture search, reporting state-of-the-art results on ImageNet and NAS-Bench-201 with 4-10x lower search cost.
The paper organizes repository-level retrieval-augmented code generation into a unified framework covering retrieval substrate, control regime, and evaluation setting while summarizing strategies, datasets, and challenges.
citing papers explorer
-
MemEvoBench: Benchmarking Safety Risks from Memory Misevolution in LLM Agents
MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.
-
CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization
CosFlyTrack provides 12,000 expert UAV trajectories with aligned RGB, depth, segmentation, pose, target state, and bilingual instructions to train visual tracking agents, yielding 53-69 point gains in success rate after fine-tuning.
-
Dynamic Latent Routing
Dynamic Latent Routing jointly learns discrete latent codes, routing policies, and model parameters via dynamic search to match or exceed supervised fine-tuning by 6.6 points on average in low-data settings across four datasets and six models.
-
Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents
PPol uses LLM-driven evolutionary program search to create diverse human-like user personas for simulators, yielding 33-62% fitness gains and +17% agent task success on retail and airline domains.
-
From Web to Pixels: Bringing Agentic Search into Visual Perception
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
-
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
-
SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild
SAM 3D Animal is the first promptable framework for multi-animal 3D reconstruction from single images, built on SMAL+ and trained on the new Herd3D dataset, achieving SOTA results on Animal3D, APTv2, and Animal Kingdom benchmarks.
-
Automated In-the-Wild Data Collection for Continual AI Generated Image Detection
An automated fact-check-based pipeline for in-the-wild AI image data, when mixed with generator data in continual learning, lets detectors adapt to new generators while avoiding forgetting and delivers 8-9% accuracy gains on two existing models.
-
Interactive Episodic Memory with User Feedback
Introduces an interactive episodic memory task with user feedback and a Feedback Alignment Module that improves retrieval accuracy on video benchmarks while remaining efficient.
-
Can LLMs Deobfuscate Binary Code? A Systematic Analysis of Large Language Models into Pseudocode Deobfuscation
LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new BinDeObfBench.
-
Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with Large Language Models
GraphSSR introduces an adaptive SSR pipeline with SSR-SFT data synthesis and SSR-RL (Authenticity-Reinforced and Denoising-Reinforced stages) to overcome one-size-fits-all subgraph noise in zero-shot LLM graph reasoning.
-
ReactiveGWM: Steering NPC in Reactive Game World Models
ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.
-
Stateful Reasoning via Insight Replay
InsightReplay improves long CoT reasoning by extracting critical insights from the trace and replaying them near the active frontier, delivering +1.65 average accuracy gain across 24 model-benchmark settings.
-
Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning
Federated PEFT on LLMs across healthcare and finance datasets performs close to centralized training and beats isolated local training under non-IID conditions.
-
Continuous Discovery of Vulnerabilities in LLM Serving Systems with Fuzzing
GRIEF fuzzer finds 15 vulnerabilities including 2 CVEs in vLLM and SGLang by testing concurrent workloads for KV-cache isolation failures and cross-request interference.
-
$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin
ξ-DPO rewrites the preference objective as minimizing distance to optimal margins and defines reward as a chosen-to-rejected ratio, yielding a bounded, interpretable margin ξ set directly from the initial reward-gap distribution.
-
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and allowing a 14B model to beat Gemini-2.5-Flash.
-
Project Prometheus: Bridging the Intent Gap in Agentic Program Repair via Reverse-Engineered Executable Specifications
Prometheus reverse-engineers BDD-style executable specifications from bug failures and uses an RQA validation loop to achieve 93.97% correct patch rate on 680 Defects4J defects while rescuing 74.4% of bugs missed by strong baseline agents.
-
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
-
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.
-
LERA: LLM-Enhanced RAG for Ad Auction in Generative Chatbots
LERA is a retrieve-then-generate auction system that refines ad candidate ranking with LLM logits and applies a threshold-aware critical-value payment rule to maintain truthfulness in chatbot ad insertion.
-
Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
Squeeze Evolve is a multi-model orchestration framework that improves efficiency and performance in verifier-free evolutionary inference, cutting costs up to 3x while matching verifier-based methods on several benchmarks.
-
CoLLM-NAS: Collaborative Large Language Models for Efficient Knowledge-Guided Neural Architecture Search
CoLLM-NAS introduces a collaborative two-LLM framework with Navigator, Generator, and Coordinator modules to perform knowledge-guided neural architecture search, reporting state-of-the-art results on ImageNet and NAS-Bench-201 with 4-10x lower search cost.
-
Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches
The paper organizes repository-level retrieval-augmented code generation into a unified framework covering retrieval substrate, control regime, and evaluation setting while summarizing strategies, datasets, and challenges.
- Measuring Maximum Activations in Open Large Language Models
- TIE: Time Interval Encoding for Video Generation over Events
- Hint Tuning: Less Data Makes Better Reasoners