MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.
hub Canonical reference
Qwen3 technical report
Canonical reference. 75% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CosFlyTrack provides 12,000 expert UAV trajectories with aligned RGB, depth, segmentation, pose, target state, and bilingual instructions to train visual tracking agents, yielding 53-69 point gains in success rate after fine-tuning.
A unified measurement pipeline on 27 LLM checkpoints shows activation maxima spanning four orders of magnitude, with MoE models 14-23x lower than matched dense models and residual streams carrying the global max in most cases.
Dynamic Latent Routing jointly learns discrete latent codes, routing policies, and model parameters via dynamic search to match or exceed supervised fine-tuning by 6.6 points on average in low-data settings across four datasets and six models.
PPol uses LLM-driven evolutionary program search to create diverse human-like user personas for simulators, yielding 33-62% fitness gains and +17% agent task success on retail and airline domains.
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
TIE derives a sinc-based interval encoding from Temporal Integrability and Duration Invariance principles, raising human-verified temporal constraint satisfaction from 77.34% to 96.03% while preserving visual quality in DiT models.
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
SAM 3D Animal is the first promptable framework for multi-animal 3D reconstruction from single images, built on SMAL+ and trained on the new Herd3D dataset, achieving SOTA results on Animal3D, APTv2, and Animal Kingdom benchmarks.
An automated fact-check-based pipeline for in-the-wild AI image data, when mixed with generator data in continual learning, lets detectors adapt to new generators while avoiding forgetting and delivers 8-9% accuracy gains on two existing models.
Introduces an interactive episodic memory task with user feedback and a Feedback Alignment Module that improves retrieval accuracy on video benchmarks while remaining efficient.
LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new BinDeObfBench.
GraphSSR introduces an adaptive SSR pipeline with SSR-SFT data synthesis and SSR-RL (Authenticity-Reinforced and Denoising-Reinforced stages) to overcome one-size-fits-all subgraph noise in zero-shot LLM graph reasoning.
ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.
InsightReplay improves long CoT reasoning by extracting critical insights from the trace and replaying them near the active frontier, delivering +1.65 average accuracy gain across 24 model-benchmark settings.
Federated PEFT on LLMs across healthcare and finance datasets performs close to centralized training and beats isolated local training under non-IID conditions.
GRIEF fuzzer finds 15 vulnerabilities including 2 CVEs in vLLM and SGLang by testing concurrent workloads for KV-cache isolation failures and cross-request interference.
Hint Tuning reduces token usage 24-66% (31.5% avg) in reasoning models via 1K self-annotated samples aligned to an instruct model's capabilities while keeping benchmark accuracy.
ξ-DPO rewrites the preference objective as minimizing distance to optimal margins and defines reward as a chosen-to-rejected ratio, yielding a bounded, interpretable margin ξ set directly from the initial reward-gap distribution.
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and allowing a 14B model to beat Gemini-2.5-Flash.
Prometheus reverse-engineers BDD-style executable specifications from bug failures and uses an RQA validation loop to achieve 93.97% correct patch rate on 680 Defects4J defects while rescuing 74.4% of bugs missed by strong baseline agents.
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.
LERA is a retrieve-then-generate auction system that refines ad candidate ranking with LLM logits and applies a threshold-aware critical-value payment rule to maintain truthfulness in chatbot ad insertion.
citing papers explorer
No citing papers match the current filters.