EDEN releases the largest freely available Italian clinical notes corpus (4M notes, 6k annotated) and proposes CRF-filling as a structured extraction benchmark with zero-shot baselines from Gemma models.
Canonical reference
Emotion Neurons
Canonical reference. 76% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
co-cited works
representative citing papers
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.
KV-cache sharing boosts multi-agent QA performance but enables undetectable tampering; HMAC manifests binding agent, session, and payload reliably detect changes.
Apparent psychological profiles of LLMs are largely measurement artifacts driven by directional response bias rather than actual traits.
ReproRepo uses GitHub issues as natural supervision to benchmark LLM agents on detecting reproducibility blockers across 1,149 ML papers, with the top agent finding related issues for roughly 90% of cases.
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
CAPER derives clause-aligned supervision via SQL AST counterfactuals to train a Clause-PRM that improves execution accuracy up to 15.3% relative and failure localization to 84.53% accuracy on BIRD and Spider.
CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.
EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.
TASTE automates generation of high-coverage difficult agent benchmarks via adaptive contrastive n-gram sampling of tool sequences, yielding τ^c-Bench where models saturating τ²-Bench drop sharply and unique tool combinations more than double.
PPaint fuses expert pairwise preferences and ratings into ground truth; PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via Elo and trains the same VLM to produce a single-pass aesthetic scorer that improves SRCC across categories.
DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.
MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fixed budget allocation.
Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.
Large-scale analysis of wild LLM chat logs finds that user interaction patterns stabilize quickly after initial use and correlate with long-term outcomes like retention, creating an agency paradox of limited exploration in unconstrained systems.
ESamp trains a test-time distiller to model LLM depth-wise representation transitions and biases decoding toward high prediction-error paths to increase semantic diversity.
A dataset revealing high inter-designer disagreement on UI preferences motivates a sample-efficient method that personalizes generative interfaces by embedding new users in the space of prior designers, outperforming baselines in both modeling and user preference.
MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.
The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.
VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.
CausalReasoningBenchmark supplies 173 real-world queries that separately grade causal identification specifications and point estimates to expose distinct failure modes in automated causal systems.
citing papers explorer
-
Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits
Probabilistic circuits detect LLM hallucinations as residual-stream anomalies with up to 99% AUROC and enable dynamic correction that raises truthfulness scores while cutting unnecessary output corruption.
-
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
-
Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs
Multilingual LLMs exhibit US-centric global bias and population-size intra-lingual bias on locale-ambiguous questions, with the global bias stronger after instruction tuning.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
Evaluation of Agents under Simulated AI Marketplace Dynamics
Marketplace Evaluation uses repeated-interaction simulations to assess information access systems with marketplace-level metrics such as retention and market share that complement traditional accuracy measures.
-
ReflectCAP: Detailed Image Captioning with Reflective Memory
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.
-
DeepSlide: From Artifacts to Presentation Delivery
DeepSlide introduces a multi-agent system for full presentation preparation that matches baselines on slide quality but improves narrative flow, pacing, and script synergy via a new dual-scoreboard benchmark.
-
Agents of Chaos
An exploratory red-teaming study documents eleven cases of security, privacy, and governance failures in autonomous language-model agents with tool access and persistent memory.
-
Adaptive Autoguidance for Item-Side Fairness in Diffusion Recommender Systems
A2G-DiffRec applies adaptive autoguidance in diffusion recommenders, learning to balance main and weak model outputs via fairness-aware regularization to improve item exposure fairness with only marginal accuracy loss.
-
GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference
GRACE-MoE integrates expert grouping, dynamic replication, and locality-aware routing with hierarchical sparse communication to reduce end-to-end latency in distributed SMoE inference.
-
Learning from Natural Language Feedback for Personalized Question Answering
VAC replaces scalar rewards with natural language feedback in an alternating training loop between a feedback model and a policy model, yielding better personalized QA on the LaMP-QA benchmark.
-
Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation
KG-CFR decouples planning from execution via knowledge-grounded counterfactual reasoning, preventing critical degradation in over 95% of perturbed runs and raising argument quality from 0.694 to 0.822 in a 1v1v1 simulation.
-
Chunking Methods on Retrieval-Augmented Generation - Effectiveness Evaluation Against Computational Cost and Limitations
Empirical study claiming to be the first broad comparison of chunking methods in RAG, highlighting effectiveness, cost, and generalization limitations across scenarios.
-
User-Aware Active Knowledge Acquisition for Emotional Support Dialogue
UKA is a gradient-free active dialogue learning framework using Theory-of-Mind uncertainty estimation to acquire user-aligned conversational knowledge, outperforming baselines in dialogue quality and user alignment across benchmarks.
-
A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test
Proposes a minimum measurement standard for LLM-as-a-judge in multi-hop RAG that fixes budgets and requires cluster-aware inference, showing it alters which baseline comparisons remain significant.
-
Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions
A lifecycle-based survey of LLM fine-tuning security that reviews attacks and defenses by intervention phase and reports unified empirical findings on model-dependent attack effectiveness and limited defense generalization.
-
When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models
Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.
-
CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation
CroSearch-R1 applies search-augmented RL with cross-lingual integration and multilingual rollouts to improve RAG effectiveness on multilingual collections.
-
Quantum Knowledge Graph: Modeling Context-Dependent Triplet Validity
Quantum Knowledge Graphs model context-dependent triplet validity and improve LLM medical reasoning accuracy by 1.4 to 6 percentage points over baselines.
-
Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
Nemotron 3 Super is an open 120B hybrid Mamba-Attention MoE model with new LatentMoE architecture and MTP layers that matches accuracy of similar models while delivering up to 7.5x higher inference throughput.
-
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
-
Self-Describing Structured Data with Dual-Layer Guidance: A Lightweight Alternative to RAG for Precision Retrieval in Large-Scale LLM Knowledge Navigation
SDSR places human metadata at file primacy and combines it with prompt routing rules to reach 100% primary category accuracy on a 119-category benchmark, far above the 65% no-guidance baseline.
-
When the Chain Breaks: Interactive Diagnosis of LLM Chain-of-Thought Reasoning Errors
ReasonDiag combines automated error detection with interactive visualizations to help users identify and diagnose errors in LLM chain-of-thought reasoning traces.
-
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
-
A Comparative Study on Affective Cues in Text Embeddings Across Psychological Emotion Theories
Open-weight instruction-aware encoders capture equal or greater affective information than proprietary models at word level across emotion theories, while task-tuned and proprietary encoders perform best on sentence-level classification.
-
Projecting the Emerging Mindset of SWE Agent by Launching a Wild Code Understanding Journey
Ada is a scoped apparatus that records SWE-agent trajectories in real repositories and applies observation lenses to project navigation, evidence selection, synthesis, grounding, and stopping behaviors across 408 runs.
-
What Am I Missing? Question-Answering as Hidden State Probing
Question generation produces a hidden-state signal that predicts final correctness before the answer is produced, yet gating interventions based on that signal do not reliably improve trajectories.
-
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
QuestBench is a student-constructed benchmark of 256 questions on which current deep research AI systems achieve a mean pass rate of 16.85% and a best-case rate of 57.58%.
-
A Game Theoretic Free Energy Analysis of Higher Order Synergy in Attention Heads of Large Language Models
Attention heads exhibit negative higher-order synergy (negative triple dividends), allowing pruning of redundant heads that cuts FLOPs by ~18% with only small perplexity increase.
-
Building Persona-Based Agents On Demand: Tailoring Multi-Agent Workflows to User Needs
On-demand runtime generation of persona-based agents can enable personalized multi-agent AI workflows beyond fixed hard-coded architectures.
-
Is it Cake or is it AI? A Systematic Review of Human Uncertainty in Distinguishing Generative Artificial Intelligence Content
Humans perform at chance levels when distinguishing generative AI content from human content in text, images, and voice.
-
Predicting one-year clinical instability and mortality in heart failure patients using sequence modeling
Sequence models on EHR data from a Swedish heart failure cohort achieve AUPRCs of 0.555 to 0.854 for one-year instability and mortality predictions and support four care pathways.
-
Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator
The paper proposes a four-role framework for LLMs in scientific innovation and reviews methods, benchmarks, and limitations across Assistant, Collaborator, Scientist, and Evaluator roles.
-
Reasoning Beyond Prediction: From Data-Driven to Causal Software Engineering
Calls for a new paradigm in software engineering where machines support causal reasoning rather than only prediction from data patterns.
-
It's Complicated: On the Design and Evaluation of AI-Powered AAC Interfaces
This paper discusses design challenges and evaluation difficulties for AI-powered AAC interfaces, proposing more robust methods that incorporate users' multifaceted and intersectional needs.
-
Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation
A synthesis of 247 papers on LLM agent security identifies prompt injection and tool hijacking as dominant threats, notes weakly compositional defenses, and argues for trust boundaries and realistic evaluations.
-
Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation
A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.
-
Overview of HIPE-2026: Person-Place Relation Extraction from Multilingual Historical Texts
HIPE-2026 is an evaluation campaign with 17 teams testing relation extraction for person presence at locations in 19th-20th century newspapers across French, German, and English plus a literary generalization set.
- Exploring LLM Agent Designs and Interaction Modalities for Scientific Visualization
- Topology-Aware LLM-Driven Social Simulation: A Unified Framework for Efficient and Realistic Agent Dynamics
- Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook