Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
hub
C o QA : A Conversational Question Answering Challenge
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
SHIFT reformulates neuron editing as learnable gate modulation on under 0.01% parameters to let LLMs adaptively balance contextual and parametric knowledge during RAG generation.
Manifold Power Iteration aligns MoE router rows with principal singular directions of experts via a power-then-retract process, with theory showing convergence and experiments on 1B-11B models showing gains.
TN-gram replaces per-order hash tables in n-gram memory modules with a CP tensor factorization that shares token-position factors and uses order-absorption vectors, achieving comparable or better performance with fewer parameters.
Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.
Agentic GraphRAG constructs a Neo4j graph via deterministic structured ingestion plus LLM extraction from notices, then deploys modular agents with tool access and reflection to outperform vector-RAG baselines on Swiss commercial gazette data across entity resolution, answer quality, and multi-turn
DQA maintains persistent diagnostic state and aggregates retrievals at the root-cause level to reach 78.7% success on 150 enterprise IT scenarios versus 41.3% for standard multi-turn RAG while cutting average turns from 8.4 to 3.9.
Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.
A regression model using attention features and recurrent uncertainty scores improves selective generation in LLMs over unsupervised and supervised baselines on ten datasets and three models.
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
citing papers explorer
-
Evaluating Temporal Consistency in Multi-Turn Language Models
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
-
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
-
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
-
SimDiff: Depth Pruning via Similarity and Difference
SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.
-
SHIFT: Gate-Modulated Activation Steering for Knowledge Conflict Mitigation in Retrieval-Augmented Generation
SHIFT reformulates neuron editing as learnable gate modulation on under 0.01% parameters to let LLMs adaptively balance contextual and parametric knowledge during RAG generation.
-
Redesign Mixture-of-Experts Routers with Manifold Power Iteration
Manifold Power Iteration aligns MoE router rows with principal singular directions of experts via a power-then-retract process, with theory showing convergence and experiments on 1B-11B models showing gains.
-
Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs
TN-gram replaces per-order hash tables in n-gram memory modules with a CP tensor factorization that shares token-position factors and uses order-absorption vectors, achieving comparable or better performance with fewer parameters.
-
Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs
Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.
-
Agentic GraphRAG: Navigating Unstructured Financial Data with Collaborative AI
Agentic GraphRAG constructs a Neo4j graph via deterministic structured ingestion plus LLM extraction from notices, then deploys modular agents with tool access and reflection to outperform vector-RAG baselines on Swiss commercial gazette data across entity resolution, answer quality, and multi-turn
-
DQA: Diagnostic Question Answering for IT Support
DQA maintains persistent diagnostic state and aggregates retrievals at the root-cause level to reach 78.7% success on 150 enterprise IT scenarios versus 41.3% for standard multi-turn RAG while cutting average turns from 8.4 to 3.9.
-
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.