Reasoning models naturally compress context via thinking traces, with reward-constrained optimization yielding 17-23% gains over baselines on long-context QA at high compression ratios.
hub
Long- context llms struggle with long in-context learning
27 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
Step-TP is a dataset providing grounded, atomic step-level IR transitions and CoT supervision to enable reliable multi-step LLM-guided tensor program optimization instead of end-to-end imitation.
Code Researcher retrieves global context via multi-step reasoning on code semantics, patterns, and commit history to fix Linux kernel crashes, reaching 48% crash-resolution rate versus 31% for baselines.
MicroAgent framework assigns five subtasks to specialized agents with multi-granularity context and analytical tools, achieving 89.2% average accuracy on 10 Java applications and beating prior methods by 24.6%.
LPES uses per-layer scaling factors optimized by a genetic algorithm with Bézier curves to balance attention and improve long-context LLM performance by up to 11.2% on key-value retrieval.
Mindgames introduces a four-game evaluation platform for multi-agent LLM reasoning, runs a 944-agent competition, surfaces rule-adherence and error-survival limitations, and releases a 29k-game dataset with an offline scoring protocol.
ERFSL generates and optimizes LLM-based reward functions for custom multi-objective RL, correcting codes in one iteration and converging weights in 5.2 iterations on average even from 500x errors.
MMCL-Bench shows that even the strongest frontier multimodal models solve fewer than one-third of tasks requiring recovery and application of visual rules, procedures, and empirical patterns.
Automation-Exploit is a multi-agent LLM system that uses conditional digital-twin validation to perform risk-mitigated exploitation of logical, web, and memory-corruption vulnerabilities in black-box targets.
GR-Evolve applies LLM-driven code evolution to global routing, reporting up to 8.72% post-detailed-routing wirelength reduction on seven benchmarks across three technology nodes.
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
LLM agents can automatically infer identifiable and sensitive personal attributes from public activities on pseudonymous platforms with high effectiveness.
SafeTrans achieves up to 80% successful C-to-Rust translations via LLM iterative repair on 2653 programs and two real projects, with some C vulnerabilities carrying over to the Rust output.
KG-HTC integrates knowledge graphs into LLMs via RAG to improve zero-shot hierarchical text classification performance on WoS, DBpedia, and Amazon datasets.
ERFSL uses LLMs to create per-requirement reward components, correct their code via a critic, and optimize weights with genetic-algorithm-style mutation and crossover driven by training logs, succeeding in a zero-shot data collection task.
EASE-TTT creates a soft attention target from evidence chunks to guide query-side test-time adaptation, yielding higher macro-average scores than full-context, retrieval-only, and standard qTTT baselines on six LongBench QA tasks.
Memory-R2 proposes LoGo-GRPO to fix unfair trajectory comparisons in RL training of memory-augmented LLM agents by combining global end-to-end rewards with local rerollouts from identical memory states.
Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
POPI distills user preferences into reusable natural-language summaries via a shared inference model and conditions a generator on them, trained jointly with RL to improve personalization quality while cutting context length by up to 10x on benchmarks.
A survey proposing a holistic GraphRAG framework with components including query processor, retriever, organizer, generator, and data source, plus domain-tailored reviews, challenges, and future directions.
MATCH augments sparsified attention with an efficient in-context retrieval system to boost performance on long-range recall tasks in transformers.
DD-GEPA decomposes and optimizes prompts with GEPA for LLM-based dialogue disentanglement, reporting accuracy gains over baseline and hand-crafted prompts on benchmarks.
Headache specialists preferred their own literature summaries over those from Sonnet, GPT-4o, and Llama 3.1 in a blinded evaluation, though AI summaries were sometimes indistinguishable.
citing papers explorer
-
Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor
Reasoning models naturally compress context via thinking traces, with reward-constrained optimization yielding 17-23% gains over baselines on long-context QA at high compression ratios.
-
Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization
Step-TP is a dataset providing grounded, atomic step-level IR transitions and CoT supervision to enable reliable multi-step LLM-guided tensor program optimization instead of end-to-end imitation.
-
Code Researcher: Deep Research Agent for Large Systems Code and Commit History
Code Researcher retrieves global context via multi-step reasoning on code semantics, patterns, and commit history to fix Linux kernel crashes, reaching 48% crash-resolution rate versus 31% for baselines.
-
MicroAgent: Context-Augmented Multi-Agent Framework for Automatic Microservice Decomposition
MicroAgent framework assigns five subtasks to specialized agents with multi-granularity context and analytical tools, achieving 89.2% average accuracy on 10 Java applications and beating prior methods by 24.6%.
-
Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling
LPES uses per-layer scaling factors optimized by a genetic algorithm with Bézier curves to balance attention and improve long-context LLM performance by up to 11.2% on key-value retrieval.
-
MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs
Mindgames introduces a four-game evaluation platform for multi-agent LLM reasoning, runs a 944-agent competition, surfaces rule-adherence and error-survival limitations, and releases a 29k-game dataset with an offline scoring protocol.
-
ERFSL: An Efficient Reward Function Searcher via Language Models for Custom-Environment Multi-Objective Optimization (Student Abstract)
ERFSL generates and optimizes LLM-based reward functions for custom multi-objective RL, correcting codes in one iteration and converging weights in 5.2 iterations on average even from 500x errors.
-
MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence
MMCL-Bench shows that even the strongest frontier multimodal models solve fewer than one-third of tasks requiring recovery and application of visual rules, procedures, and empirical patterns.
-
Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation
Automation-Exploit is a multi-agent LLM system that uses conditional digital-twin validation to perform risk-mitigated exploitation of logical, web, and memory-corruption vulnerabilities in black-box targets.
-
GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution
GR-Evolve applies LLM-driven code evolution to global routing, reporting up to 8.72% post-detailed-routing wirelength reduction on seven benchmarks across three technology nodes.
-
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
-
Learning to Adapt: In-Context Learning Beyond Stationarity
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
-
Automated Profile Inference with Language Model Agents
LLM agents can automatically infer identifiable and sensitive personal attributes from public activities on pseudonymous platforms with high effectiveness.
-
KG-HTC: Integrating Knowledge Graphs into LLMs for Effective Zero-shot Hierarchical Text Classification
KG-HTC integrates knowledge graphs into LLMs via RAG to improve zero-shot hierarchical text classification performance on WoS, DBpedia, and Amazon datasets.
-
Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement
ERFSL uses LLMs to create per-requirement reward components, correct their code via a critic, and optimize weights with genetic-algorithm-style mutation and crossover driven by training logs, succeeding in a zero-shot data collection task.
-
EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering
EASE-TTT creates a soft attention target from evidence chunks to guide query-side test-time adaptation, yielding higher macro-average scores than full-context, retrieval-only, and standard qTTT baselines on six LongBench QA tasks.
-
Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents
Memory-R2 proposes LoGo-GRPO to fix unfair trajectory comparisons in RL training of memory-augmented LLM agents by combining global end-to-end rewards with local rerollouts from identical memory states.
-
LLMs with in-context learning for Algorithmic Theoretical Physics
Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
-
POPI: Personalizing LLMs via Optimized Natural Language Preference Inference
POPI distills user preferences into reusable natural-language summaries via a shared inference model and conditions a generator on them, trained jointly with RL to improve personalization quality while cutting context length by up to 10x on benchmarks.
-
Retrieval-Augmented Generation with Graphs (GraphRAG)
A survey proposing a holistic GraphRAG framework with components including query processor, retriever, organizer, generator, and data source, plus domain-tailored reviews, challenges, and future directions.
-
MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers
MATCH augments sparsified attention with an efficient in-context retrieval system to boost performance on long-range recall tasks in transformers.
-
DD-GEPA: Prompt Optimization for Dialogue Disentanglement Focusing on Task Instruction and Utterance Representation
DD-GEPA decomposes and optimizes prompts with GEPA for LLM-based dialogue disentanglement, reporting accuracy gains over baseline and hand-crafted prompts on benchmarks.
-
Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison
Headache specialists preferred their own literature summaries over those from Sonnet, GPT-4o, and Llama 3.1 in a blinded evaluation, though AI summaries were sometimes indistinguishable.
-
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
-
Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy
A parallel compliance architecture using multi-stage LLM retrieval improves correctness and reasoning quality over a baseline for OT cybersecurity compliance queries in a railway case study.
-
Generative AI-Based Virtual Assistant using Retrieval-Augmented Generation: An evaluation study for bachelor projects
A RAG-based virtual assistant was developed and evaluated to deliver accurate, context-specific responses for students navigating university project regulations.