NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.
International Conference on Learning Representations , year =
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
background 2polarities
background 2representative citing papers
Evaluation artifacts substantially inflate the measured unsolvability ceiling in multi-LLM routing, leading to distorted router training and overstated headroom.
NeuroMAS reframes multi-agent language systems as neural architectures where LLM agents learn coordination via reinforcement learning rather than predefined roles.
The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.
LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.
XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.
citing papers explorer
-
NARRA-Gym for Evaluating Interactive Narrative Agents
NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.
-
Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts
Evaluation artifacts substantially inflate the measured unsolvability ceiling in multi-LLM routing, leading to distorted router training and overstated headroom.
-
NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning
NeuroMAS reframes multi-agent language systems as neural architectures where LLM agents learn coordination via reinforcement learning rather than predefined roles.
-
The Generalized Turing Test: A Foundation for Comparing Intelligence
The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.
-
Language models fail at extended rule following
LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.
-
XekRung Technical Report
XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.
- PRISM: Preference-Aware Influence Function Based Data Selection Method for Efficient Fine-Tuning