AlpsBench supplies 2500 real-dialogue sequences with verified memories to benchmark LLM extraction, updating, retrieval, and utilization of personalized information.
hub
Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation.arXiv preprint arXiv:2512.06690.2025
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 13roles
baseline 1polarities
baseline 1representative citing papers
LLMs fail to capture embodied cognition and cultural variation in demonstrative use, unlike humans who show language-specific proximal-distal and perspective-taking patterns.
TriRec is a two-stage LLM-agent recommender that uses item self-promotion followed by platform-level sequential re-ranking to jointly optimize user utility, item exposure, and exposure fairness.
PAD-Rec augments standard draft models with item-position and step-position embeddings plus learnable gates, delivering up to 3.1x wall-clock speedup and 5% average gain over strong speculative-decoding baselines on four datasets while largely preserving recommendation quality.
A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.
Agent-GWO uses collaborative grey-wolf-inspired agents to jointly optimize LLM prompts and decoding settings, yielding higher accuracy and stability than prior single-agent prompt optimization methods on math and hybrid reasoning benchmarks.
FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.
A hybrid graph-based training-free framework for LLM context compression matches strong baselines and shows larger gains on long-document benchmarks.
CAP-CoT uses iterative adversarial prompt cycles to improve CoT accuracy, stability, and robustness across six benchmarks and four LLM backbones.
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
AGSC combines NLI neutral probabilities for adaptive granularity with GMM semantic clustering to improve uncertainty quantification in long-text LLM generation, claiming SOTA factuality correlation and 60% faster inference.
LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.
A small language model resolves semantic risks and conflicts in prompts via multi-perspective consistency checks, yielding a 2.5-point gain in LLM reasoning performance at $0.02 cost.
citing papers explorer
-
AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
AlpsBench supplies 2500 real-dialogue sequences with verified memories to benchmark LLM extraction, updating, retrieval, and utilization of personalized information.
-
Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives
LLMs fail to capture embodied cognition and cultural variation in demonstrative use, unlike humans who show language-specific proximal-distal and perspective-taking patterns.
-
Breaking User-Centric Agency: A Tri-Party Framework for Agent-Based Recommendation
TriRec is a two-stage LLM-agent recommender that uses item self-promotion followed by platform-level sequential re-ranking to jointly optimize user utility, item exposure, and exposure fairness.
-
Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation
PAD-Rec augments standard draft models with item-position and step-position embeddings plus learnable gates, delivering up to 3.1x wall-clock speedup and 5% average gain over strong speculative-decoding baselines on four datasets while largely preserving recommendation quality.
-
Learning to Control Summaries with Score Ranking
A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.
-
Agent-GWO: Collaborative Agents for Dynamic Prompt Optimization in Large Language Models
Agent-GWO uses collaborative grey-wolf-inspired agents to jointly optimize LLM prompts and decoding settings, yielding higher accuracy and stability than prior single-agent prompt optimization methods on math and hybrid reasoning benchmarks.
-
FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning
FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.
-
From Similarity to Structure: Training-free LLM Context Compression with Hybrid Graph Priors
A hybrid graph-based training-free framework for LLM context compression matches strong baselines and shows larger gains on long-document benchmarks.
-
CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning
CAP-CoT uses iterative adversarial prompt cycles to improve CoT accuracy, stability, and robustness across six benchmarks and four LLM backbones.
-
Calibrating Model-Based Evaluation Metrics for Summarization
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
-
AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation
AGSC combines NLI neutral probabilities for adaptive granularity with GMM semantic clustering to improve uncertainty quantification in long-text LLM generation, claiming SOTA factuality correlation and 60% faster inference.
-
Medical Reasoning with Large Language Models: A Survey and MR-Bench
LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.
-
Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt
A small language model resolves semantic risks and conflicts in prompts via multi-perspective consistency checks, yielding a 2.5-point gain in LLM reasoning performance at $0.02 cost.