hub

Long- context llms struggle with long in-context learning

Long-context llms struggle with long in-context learning , author= · 2024 · arXiv 2404.02060

27 Pith papers cite this work. Polarity classification is still indexing.

27 Pith papers citing it

read on arXiv browse 27 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

Reasoning models naturally compress context via thinking traces, with reward-constrained optimization yielding 17-23% gains over baselines on long-context QA at high compression ratios.

Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

Step-TP is a dataset providing grounded, atomic step-level IR transitions and CoT supervision to enable reliable multi-step LLM-guided tensor program optimization instead of end-to-end imitation.

Code Researcher: Deep Research Agent for Large Systems Code and Commit History

cs.SE · 2025-05-27 · unverdicted · novelty 7.0

Code Researcher retrieves global context via multi-step reasoning on code semantics, patterns, and commit history to fix Linux kernel crashes, reaching 48% crash-resolution rate versus 31% for baselines.

MicroAgent: Context-Augmented Multi-Agent Framework for Automatic Microservice Decomposition

cs.SE · 2026-06-29 · unverdicted · novelty 6.0

MicroAgent framework assigns five subtasks to specialized agents with multi-granularity context and analytical tools, achieving 89.2% average accuracy on 10 Java applications and beating prior methods by 24.6%.

Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

LPES uses per-layer scaling factors optimized by a genetic algorithm with Bézier curves to balance attention and improve long-context LLM performance by up to 11.2% on key-value retrieval.

MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Mindgames introduces a four-game evaluation platform for multi-agent LLM reasoning, runs a 944-agent competition, surfaces rule-adherence and error-survival limitations, and releases a 29k-game dataset with an offline scoring protocol.

ERFSL: An Efficient Reward Function Searcher via Language Models for Custom-Environment Multi-Objective Optimization (Student Abstract)

eess.SY · 2026-05-19 · unverdicted · novelty 6.0

ERFSL generates and optimizes LLM-based reward functions for custom multi-objective RL, correcting codes in one iteration and converging weights in 5.2 iterations on average even from 500x errors.

MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

MMCL-Bench shows that even the strongest frontier multimodal models solve fewer than one-third of tasks requiring recovery and application of visual rules, procedures, and empirical patterns.

Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation

cs.CR · 2026-04-24 · unverdicted · novelty 6.0

Automation-Exploit is a multi-agent LLM system that uses conditional digital-twin validation to perform risk-mitigated exploitation of logical, web, and memory-corruption vulnerabilities in black-box targets.

GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution

cs.AR · 2026-04-24 · unverdicted · novelty 6.0

GR-Evolve applies LLM-driven code evolution to global routing, reporting up to 8.72% post-detailed-routing wirelength reduction on seven benchmarks across three technology nodes.

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies

cs.IR · 2026-04-20 · unverdicted · novelty 6.0

CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.

Learning to Adapt: In-Context Learning Beyond Stationarity

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.

Automated Profile Inference with Language Model Agents

cs.CR · 2025-05-18 · unverdicted · novelty 6.0

LLM agents can automatically infer identifiable and sensitive personal attributes from public activities on pseudonymous platforms with high effectiveness.

SafeTrans: LLM-assisted Transpilation from C to Rust

cs.CR · 2025-05-15 · accept · novelty 6.0

SafeTrans achieves up to 80% successful C-to-Rust translations via LLM iterative repair on 2653 programs and two real projects, with some C vulnerabilities carrying over to the Rust output.

KG-HTC: Integrating Knowledge Graphs into LLMs for Effective Zero-shot Hierarchical Text Classification

cs.CL · 2025-05-08 · unverdicted · novelty 6.0

KG-HTC integrates knowledge graphs into LLMs via RAG to improve zero-shot hierarchical text classification performance on WoS, DBpedia, and Amazon datasets.

Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement

cs.LG · 2024-09-04 · unverdicted · novelty 6.0

ERFSL uses LLMs to create per-requirement reward components, correct their code via a critic, and optimize weights with genetic-algorithm-style mutation and crossover driven by training logs, succeeding in a zero-shot data collection task.

EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering

cs.CL · 2026-06-05 · unverdicted · novelty 5.0

EASE-TTT creates a soft attention target from evidence chunks to guide query-side test-time adaptation, yielding higher macro-average scores than full-context, retrieval-only, and standard qTTT baselines on six LongBench QA tasks.

Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

cs.LG · 2026-05-20 · unverdicted · novelty 5.0

Memory-R2 proposes LoGo-GRPO to fix unfair trajectory comparisons in RL training of memory-augmented LLM agents by combining global end-to-end rewards with local rerollouts from identical memory states.

LLMs with in-context learning for Algorithmic Theoretical Physics

cs.LG · 2026-05-06 · unverdicted · novelty 5.0

Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.

POPI: Personalizing LLMs via Optimized Natural Language Preference Inference

cs.CL · 2025-10-17 · unverdicted · novelty 5.0

POPI distills user preferences into reusable natural-language summaries via a shared inference model and conditions a generator on them, trained jointly with RL to improve personalization quality while cutting context length by up to 10x on benchmarks.

Retrieval-Augmented Generation with Graphs (GraphRAG)

cs.IR · 2024-12-31 · unverdicted · novelty 5.0

A survey proposing a holistic GraphRAG framework with components including query processor, retriever, organizer, generator, and data source, plus domain-tailored reviews, challenges, and future directions.

MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers

cs.CL · 2026-06-29 · unverdicted · novelty 4.0

MATCH augments sparsified attention with an efficient in-context retrieval system to boost performance on long-range recall tasks in transformers.

DD-GEPA: Prompt Optimization for Dialogue Disentanglement Focusing on Task Instruction and Utterance Representation

cs.SE · 2026-06-05 · unverdicted · novelty 4.0

DD-GEPA decomposes and optimizes prompts with GEPA for LLM-based dialogue disentanglement, reporting accuracy gains over baseline and hand-crafted prompts on benchmarks.

Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

cs.AI · 2026-06-03 · unverdicted · novelty 4.0

Headache specialists preferred their own literature summaries over those from Sonnet, GPT-4o, and Llama 3.1 in a blinded evaluation, though AI summaries were sometimes indistinguishable.

citing papers explorer

Showing 26 of 26 citing papers after filters.

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor cs.AI · 2026-05-27 · unverdicted · none · ref 21
Reasoning models naturally compress context via thinking traces, with reward-constrained optimization yielding 17-23% gains over baselines on long-context QA at high compression ratios.
Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization cs.LG · 2026-05-25 · unverdicted · none · ref 25
Step-TP is a dataset providing grounded, atomic step-level IR transitions and CoT supervision to enable reliable multi-step LLM-guided tensor program optimization instead of end-to-end imitation.
Code Researcher: Deep Research Agent for Large Systems Code and Commit History cs.SE · 2025-05-27 · unverdicted · none · ref 19
Code Researcher retrieves global context via multi-step reasoning on code semantics, patterns, and commit history to fix Linux kernel crashes, reaching 48% crash-resolution rate versus 31% for baselines.
MicroAgent: Context-Augmented Multi-Agent Framework for Automatic Microservice Decomposition cs.SE · 2026-06-29 · unverdicted · none · ref 35
MicroAgent framework assigns five subtasks to specialized agents with multi-granularity context and analytical tools, achieving 89.2% average accuracy on 10 Java applications and beating prior methods by 24.6%.
Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling cs.CL · 2026-06-26 · unverdicted · none · ref 10
LPES uses per-layer scaling factors optimized by a genetic algorithm with Bézier curves to balance attention and improve long-context LLM performance by up to 11.2% on key-value retrieval.
MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs cs.AI · 2026-05-28 · unverdicted · none · ref 45
Mindgames introduces a four-game evaluation platform for multi-agent LLM reasoning, runs a 944-agent competition, surfaces rule-adherence and error-survival limitations, and releases a 29k-game dataset with an offline scoring protocol.
ERFSL: An Efficient Reward Function Searcher via Language Models for Custom-Environment Multi-Objective Optimization (Student Abstract) eess.SY · 2026-05-19 · unverdicted · none · ref 20
ERFSL generates and optimizes LLM-based reward functions for custom multi-objective RL, correcting codes in one iteration and converging weights in 5.2 iterations on average even from 500x errors.
MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence cs.CV · 2026-05-12 · unverdicted · none · ref 4
MMCL-Bench shows that even the strongest frontier multimodal models solve fewer than one-third of tasks requiring recovery and application of visual rules, procedures, and empirical patterns.
Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation cs.CR · 2026-04-24 · unverdicted · none · ref 39
Automation-Exploit is a multi-agent LLM system that uses conditional digital-twin validation to perform risk-mitigated exploitation of logical, web, and memory-corruption vulnerabilities in black-box targets.
GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution cs.AR · 2026-04-24 · unverdicted · none · ref 23
GR-Evolve applies LLM-driven code evolution to global routing, reporting up to 8.72% post-detailed-routing wirelength reduction on seven benchmarks across three technology nodes.
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies cs.IR · 2026-04-20 · unverdicted · none · ref 16
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
Learning to Adapt: In-Context Learning Beyond Stationarity cs.LG · 2026-04-13 · unverdicted · none · ref 27
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
Automated Profile Inference with Language Model Agents cs.CR · 2025-05-18 · unverdicted · none · ref 2
LLM agents can automatically infer identifiable and sensitive personal attributes from public activities on pseudonymous platforms with high effectiveness.
KG-HTC: Integrating Knowledge Graphs into LLMs for Effective Zero-shot Hierarchical Text Classification cs.CL · 2025-05-08 · unverdicted · none · ref 18
KG-HTC integrates knowledge graphs into LLMs via RAG to improve zero-shot hierarchical text classification performance on WoS, DBpedia, and Amazon datasets.
Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement cs.LG · 2024-09-04 · unverdicted · none · ref 16
ERFSL uses LLMs to create per-requirement reward components, correct their code via a critic, and optimize weights with genetic-algorithm-style mutation and crossover driven by training logs, succeeding in a zero-shot data collection task.
EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering cs.CL · 2026-06-05 · unverdicted · none · ref 62
EASE-TTT creates a soft attention target from evidence chunks to guide query-side test-time adaptation, yielding higher macro-average scores than full-context, retrieval-only, and standard qTTT baselines on six LongBench QA tasks.
Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents cs.LG · 2026-05-20 · unverdicted · none · ref 8
Memory-R2 proposes LoGo-GRPO to fix unfair trajectory comparisons in RL training of memory-augmented LLM agents by combining global end-to-end rewards with local rerollouts from identical memory states.
LLMs with in-context learning for Algorithmic Theoretical Physics cs.LG · 2026-05-06 · unverdicted · none · ref 20
Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
POPI: Personalizing LLMs via Optimized Natural Language Preference Inference cs.CL · 2025-10-17 · unverdicted · none · ref 26
POPI distills user preferences into reusable natural-language summaries via a shared inference model and conditions a generator on them, trained jointly with RL to improve personalization quality while cutting context length by up to 10x on benchmarks.
Retrieval-Augmented Generation with Graphs (GraphRAG) cs.IR · 2024-12-31 · unverdicted · none · ref 236
A survey proposing a holistic GraphRAG framework with components including query processor, retriever, organizer, generator, and data source, plus domain-tailored reviews, challenges, and future directions.
MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers cs.CL · 2026-06-29 · unverdicted · none · ref 95
MATCH augments sparsified attention with an efficient in-context retrieval system to boost performance on long-range recall tasks in transformers.
DD-GEPA: Prompt Optimization for Dialogue Disentanglement Focusing on Task Instruction and Utterance Representation cs.SE · 2026-06-05 · unverdicted · none · ref 127
DD-GEPA decomposes and optimizes prompts with GEPA for LLM-based dialogue disentanglement, reporting accuracy gains over baseline and hand-crafted prompts on benchmarks.
Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison cs.AI · 2026-06-03 · unverdicted · none · ref 133
Headache specialists preferred their own literature summaries over those from Sonnet, GPT-4o, and Llama 3.1 in a blinded evaluation, though AI summaries were sometimes indistinguishable.
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges cs.AI · 2025-10-27 · unverdicted · none · ref 250
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy cs.AI · 2025-04-18 · unverdicted · none · ref 29
A parallel compliance architecture using multi-stage LLM retrieval improves correctness and reasoning quality over a baseline for OT cybersecurity compliance queries in a railway case study.
Generative AI-Based Virtual Assistant using Retrieval-Augmented Generation: An evaluation study for bachelor projects cs.CL · 2026-04-01 · unverdicted · none · ref 22
A RAG-based virtual assistant was developed and evaluated to deliver accurate, context-specific responses for students navigating university project regulations.

Long- context llms struggle with long in-context learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer