pith. the verified trust layer for science. sign in

arxiv: 2506.01062 · v4 · submitted 2025-06-01 · 💻 cs.CL · cs.AI· cs.LG

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

classification 💻 cs.CL cs.AIcs.LG
keywords modelssealqareasoningaccuracyseal-0achieveacrosseven
0
0 comments X p. Extension
read the original abstract

We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI

    cs.CL 2026-04 unverdicted novelty 6.0

    APEX-MEM uses property graphs with temporal events, append-only storage, and an agentic retrieval system to reach 88.88% accuracy on LOCOMO QA and 86.2% on LongMemEval, outperforming prior session-aware methods.

  2. Search, Do not Guess: Teaching Small Language Models to Be Effective Search Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    A fine-tuning policy trains small language models to search reliably and use evidence, improving multi-hop QA performance by 15-17 points to reach large-model levels.

  3. ExpSeek: Self-Triggered Experience Seeking for Web Agents

    cs.CL 2026-01 unverdicted novelty 6.0

    ExpSeek shifts web agents to self-triggered step-level experience seeking via entropy thresholds, delivering 9.3% and 7.5% absolute gains on Qwen3-8B and 32B models across four benchmarks.

  4. MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

    cs.CL 2025-11 unverdicted novelty 6.0

    MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.

  5. Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    Tool-augmented LLM reasoning incurs a protocol-induced performance tax that can exceed tool benefits under semantic noise, partially mitigated by a lightweight gate called G-STEP.

  6. Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.

  7. EvoSkill: Automated Skill Discovery for Multi-Agent Systems

    cs.AI 2026-03 unverdicted novelty 5.0

    EvoSkill evolves agent skills via failure analysis and Pareto frontier selection, raising exact-match accuracy 7.3% on OfficeQA and 12.1% on SealQA with 5.3% zero-shot transfer to BrowseComp.

  8. Kimi K2.5: Visual Agentic Intelligence

    cs.CL 2026-02 unverdicted novelty 5.0

    Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.