EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
hub
ACM Transactions on Information Systems , year =
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3representative citing papers
EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
CACM improves language-based drug discovery agents by 36.4% via protocol auditing, a grounded diagnostician, and compressed static/dynamic/corrective memory channels that localize failures and bias corrections.
SLoD detects emergent scale boundaries in knowledge graphs by applying spectral heat diffusion to Poincare embeddings, recovering planted hierarchies in synthetic data and aligning with taxonomic depths in WordNet without resolution-parameter tuning.
ReasoningBank distills generalizable reasoning strategies from agent successes and failures to enable self-evolution, with memory-aware test-time scaling amplifying gains over raw-trajectory or success-only memory on web and software benchmarks.
DeferMem decouples memory QA into high-recall retrieval and RL-based query-conditioned evidence distillation, outperforming baselines on LoCoMo and LongMemEval-S with highest accuracy, fastest runtime, and zero API token cost.
APG4RecSim automatically generates realistic user profiles for LLM-based recommendation simulations, outperforming manual baselines by up to 7% in nDCG@10 and 8% in JSD on three benchmark datasets.
A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
Personalized soft prompts steer VLM attention to match user-specific gaze patterns, yielding better attention alignment and click prediction in recommendation simulations.
Web2BigTable introduces a bi-level multi-agent system that achieves new state-of-the-art results on wide-coverage and deep web-to-table search benchmarks through orchestration, coordination, and closed-loop reflection.
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.
VizCopilot integrates topic modeling with document visualization to support user oversight of retrieved context in enterprise chatbots, enabling detection of misalignments and adaptation of prompting strategies.
citing papers explorer
-
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
-
EXG: Self-Evolving Agents with Experience Graphs
EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
-
Constraint-Aware Corrective Memory for Language-Based Drug Discovery Agents
CACM improves language-based drug discovery agents by 36.4% via protocol auditing, a grounded diagnostician, and compressed static/dynamic/corrective memory channels that localize failures and bias corrections.
-
Semantic Level of Detail for Knowledge Graphs: Discovering Abstraction Boundaries via Spectral Heat Diffusion
SLoD detects emergent scale boundaries in knowledge graphs by applying spectral heat diffusion to Poincare embeddings, recovering planted hierarchies in synthetic data and aligning with taxonomic depths in WordNet without resolution-parameter tuning.
-
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
ReasoningBank distills generalizable reasoning strategies from agent successes and failures to enable self-evolution, with memory-aware test-time scaling amplifying gains over raw-trajectory or success-only memory on web and software benchmarks.
-
DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA
DeferMem decouples memory QA into high-recall retrieval and RL-based query-conditioned evidence distillation, outperforming baselines on LoCoMo and LongMemEval-S with highest accuracy, fastest runtime, and zero API token cost.
-
Task-Aware Automated User Profile Generation for Recommendation Simulation Using Large Language Models
APG4RecSim automatically generates realistic user profiles for LLM-based recommendation simulations, outperforming manual baselines by up to 7% in nDCG@10 and 8% in JSD on three benchmark datasets.
-
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents
A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
-
Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation
Personalized soft prompts steer VLM attention to match user-specific gaze patterns, yielding better attention alignment and click prediction in recommendation simulations.
-
Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction
Web2BigTable introduces a bi-level multi-agent system that achieves new state-of-the-art results on wide-coverage and deep web-to-table search benchmarks through orchestration, coordination, and closed-loop reflection.
-
Towards Self-Improving Error Diagnosis in Multi-Agent Systems
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.
-
VizCopilot: Fostering Appropriate Reliance on Enterprise Chatbots with Context Visualization
VizCopilot integrates topic modeling with document visualization to support user oversight of retrieved context in enterprise chatbots, enabling detection of misalignments and adaptation of prompting strategies.
- Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval