pith. machine review for the scientific record. sign in

arxiv: 1705.03551 · v2 · submitted 2017-05-09 · 💻 cs.CL

Recognition: no theorem link

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Authors on Pith no claims yet

Pith reviewed 2026-05-11 14:54 UTC · model grok-4.3

classification 💻 cs.CL
keywords TriviaQAreading comprehensiondistant supervisionquestion answeringtrivia questionsevidence documentsneural network baselinefeature-based classifier
0
0 comments X

The pith

TriviaQA introduces a distant-supervision dataset of 95,000 trivia questions and evidence documents where current models reach only 40 percent accuracy against 80 percent for humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TriviaQA, a reading-comprehension resource built from trivia-enthusiast questions paired with independently collected evidence documents. It establishes that the questions are more compositional, exhibit greater lexical and syntactic distance from their answers, and require more cross-sentence reasoning than those in prior large-scale benchmarks. Two standard baselines—a feature classifier and a neural network strong on SQuAD—are shown to fall well short of human performance. The resulting gap positions the dataset as a demanding testbed intended to drive progress on harder forms of comprehension.

Core claim

We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers. Neither approach comes close to human性能 (23

What carries the argument

The TriviaQA collection of question-answer pairs paired with six independent evidence documents per question, which supplies distant supervision while enforcing compositional questions and cross-sentence reasoning.

If this is right

  • Systems must improve compositional and cross-sentence reasoning to reach high accuracy on TriviaQA.
  • Distant supervision from multiple evidence documents can serve as a scalable training signal for reading-comprehension models.
  • Lexical and syntactic variability between questions and evidence must be explicitly modeled to close the performance gap.
  • Future benchmarks should incorporate similar independently sourced evidence to maintain difficulty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset's construction method could be replicated for other domains to create distant-supervision resources without manual annotation.
  • Strong performance on TriviaQA may transfer to downstream tasks that require integrating scattered facts, such as multi-hop question answering.
  • The observed gap invites investigation of whether hybrid feature-neural architectures can narrow it faster than either approach alone.

Load-bearing premise

The independently gathered evidence documents supply high-quality distant supervision that is sufficient to answer the questions.

What would settle it

A single model that reaches near 80 percent accuracy on the held-out TriviaQA test set while using only the provided evidence documents would falsify the claim that the dataset remains a significant challenge.

read the original abstract

We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers. We also present two baseline algorithms: a feature-based classifier and a state-of-the-art neural network, that performs well on SQuAD reading comprehension. Neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that TriviaQA is a challenging testbed that is worth significant future study. Data and code available at -- http://nlp.cs.washington.edu/triviaqa/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces TriviaQA, a large-scale reading comprehension dataset with 95K trivia questions and over 650K question-answer-evidence triples. Evidence documents (six per question on average) are gathered independently via web search to provide distant supervision. The authors claim the dataset features more complex, compositional questions than prior resources like SQuAD, with greater syntactic/lexical variability between questions and answer sentences plus a higher requirement for cross-sentence reasoning. Two baselines are evaluated—a feature-based classifier and a neural model adapted from SQuAD—achieving 23% and 40% respectively against 80% human performance, positioning TriviaQA as a challenging benchmark.

Significance. If the distant supervision quality holds and the variability/reasoning claims are substantiated, TriviaQA would be a significant addition to RC benchmarks by emphasizing multi-document settings, noisy evidence, and compositional reasoning. The public release of data and code, plus the clear performance gap, would usefully drive future model development beyond single-paragraph SQuAD-style tasks.

major comments (3)
  1. [Dataset Construction] Dataset Construction section: the central claim that TriviaQA provides 'high quality distant supervision' and is challenging due to reasoning demands (rather than label noise) requires explicit validation that answer spans are present in the independently gathered evidence documents. The manuscript should report the fraction of questions for which the answer appears in at least one of the six documents (via exact string match or normalized matching) and describe any manual verification on a sample; without this, the 23%/40% baseline scores cannot be confidently attributed to syntactic variability or cross-sentence reasoning.
  2. [Analysis] Analysis section (comparison to SQuAD and other datasets): the claims of 'considerable syntactic and lexical variability' and 'more cross sentence reasoning' are load-bearing for the 'challenging testbed' conclusion, yet the abstract and provided details give no concrete quantification method (e.g., no mention of dependency-parse distance, n-gram overlap statistics, or manual annotation protocol for reasoning hops). These metrics must be defined and reported with inter-annotator agreement if manual.
  3. [Baselines] Baselines section: the neural baseline is described as 'state-of-the-art' on SQuAD, but implementation details are needed on how the multi-document evidence is handled (e.g., concatenation strategy, truncation, or per-document scoring) since this directly affects whether the 40% result reflects the dataset's claimed difficulty or an incomplete adaptation.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'six per question on average' should be accompanied by the exact mean and standard deviation of evidence documents per question for precision.
  2. [Dataset] The paper should include a small table or paragraph in the Dataset section reporting basic statistics on question length, answer type distribution, and evidence document lengths to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and will revise the paper to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Dataset Construction] Dataset Construction section: the central claim that TriviaQA provides 'high quality distant supervision' and is challenging due to reasoning demands (rather than label noise) requires explicit validation that answer spans are present in the independently gathered evidence documents. The manuscript should report the fraction of questions for which the answer appears in at least one of the six documents (via exact string match or normalized matching) and describe any manual verification on a sample; without this, the 23%/40% baseline scores cannot be confidently attributed to syntactic variability or cross-sentence reasoning.

    Authors: We agree that providing explicit statistics on answer span presence is important for validating the quality of distant supervision. Although the manuscript emphasizes the independent gathering of evidence via web search, we will add to the Dataset Construction section the requested fraction of questions where the answer appears in at least one document, computed via both exact string match and normalized matching. We will also describe the manual verification performed on a random sample of questions to confirm the presence and relevance of answers. This addition will strengthen the claim of high-quality supervision. revision: yes

  2. Referee: [Analysis] Analysis section (comparison to SQuAD and other datasets): the claims of 'considerable syntactic and lexical variability' and 'more cross sentence reasoning' are load-bearing for the 'challenging testbed' conclusion, yet the abstract and provided details give no concrete quantification method (e.g., no mention of dependency-parse distance, n-gram overlap statistics, or manual annotation protocol for reasoning hops). These metrics must be defined and reported with inter-annotator agreement if manual.

    Authors: The Analysis section provides qualitative and some quantitative comparisons to SQuAD, but we acknowledge that more explicit quantification methods are needed to support the claims. In the revised manuscript, we will define and report concrete metrics such as average n-gram overlap between questions and answer sentences, syntactic variability measured via dependency parse distances, and the proportion of questions requiring cross-sentence reasoning based on a manually annotated sample with reported inter-annotator agreement. This will make the claims more rigorous. revision: yes

  3. Referee: [Baselines] Baselines section: the neural baseline is described as 'state-of-the-art' on SQuAD, but implementation details are needed on how the multi-document evidence is handled (e.g., concatenation strategy, truncation, or per-document scoring) since this directly affects whether the 40% result reflects the dataset's claimed difficulty or an incomplete adaptation.

    Authors: We appreciate this point as it clarifies how the baseline was adapted to the multi-document setting. The neural model processes the evidence by concatenating the top-ranked documents up to the maximum sequence length, applying truncation where necessary, and selecting the highest scoring answer span across documents. We will include these specific implementation details in the Baselines section of the revised version to allow full reproducibility and proper interpretation of the results. revision: yes

Circularity Check

0 steps flagged

No circularity in dataset construction or baseline evaluation

full rationale

The paper introduces TriviaQA via collection of 95K trivia questions and independent evidence documents (six per question on average), followed by direct comparison to prior datasets and runs of standard baselines (feature-based classifier and neural network) that achieve 23% and 40% F1 versus human 80%. No equations, parameter fittings presented as predictions, self-citations that bear central claims, or ansatzes are involved. The claim that TriviaQA is a challenging testbed follows from the reported performance gap on the collected data without any reduction to self-defined quantities or imported uniqueness theorems. The work is self-contained as a data release plus empirical baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset paper; the central claim rests on new data collection and standard evaluation protocols rather than new axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5478 in / 975 out tokens · 45416 ms · 2026-05-11T14:54:48.532364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 53 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Online Learning-to-Defer with Varying Experts

    stat.ML 2026-05 unverdicted novelty 8.0

    Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.

  2. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  3. Passage Re-ranking with BERT

    cs.IR 2019-01 unverdicted novelty 8.0

    Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.

  4. Task-Aware Calibration: Provably Optimal Decoding in LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.

  5. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

    cs.CL 2026-05 unverdicted novelty 7.0

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...

  6. XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

    cs.AI 2026-04 unverdicted novelty 7.0

    XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.

  7. HaS: Accelerating RAG through Homology-Aware Speculative Retrieval

    cs.IR 2026-04 unverdicted novelty 7.0

    HaS accelerates RAG retrieval via homology-aware speculative retrieval and homologous query re-identification validation, cutting latency 24-37% with 1-2% accuracy drop on tested datasets.

  8. Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.

  9. LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    LoSA caches prefix attention for stable tokens in block-wise DLMs and applies sparse attention only to active tokens, preserving near-dense accuracy while achieving 1.54x lower attention density and up to 4.14x speedup.

  10. PolyReal: A Benchmark for Real-World Polymer Science Workflows

    cs.CV 2026-04 unverdicted novelty 7.0

    PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.

  11. Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems

    cs.IR 2026-04 unverdicted novelty 7.0

    Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.

  12. Path-Constrained Mixture-of-Experts

    cs.LG 2026-03 unverdicted novelty 7.0

    PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.

  13. The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

    eess.AS 2026-03 unverdicted novelty 7.0

    FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.

  14. Group-in-Group Policy Optimization for LLM Agent Training

    cs.LG 2025-05 unverdicted novelty 7.0

    GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...

  15. Moshi: a speech-text foundation model for real-time dialogue

    eess.AS 2024-09 accept novelty 7.0

    Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

  16. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    cs.CL 2024-02 unverdicted novelty 7.0

    M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...

  17. Multitask Prompted Training Enables Zero-Shot Task Generalization

    cs.LG 2021-10 conditional novelty 7.0

    Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.

  18. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    cs.LG 2021-01 accept novelty 7.0

    Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

  19. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    cs.LG 2019-10 unverdicted novelty 7.0

    T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...

  20. APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    APCD reduces LLM hallucinations by expanding decoding paths adaptively when entropy signals uncertainty and by contrasting divergent paths to control their interaction.

  21. Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

    cs.CL 2026-05 unverdicted novelty 6.0

    LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.

  22. Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    DisAAD trains a 1%-sized proxy model via adversarial distillation to quantify uncertainty in black-box LLMs by aligning with their output distributions.

  23. T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.

  24. Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

    cs.LG 2026-05 unverdicted novelty 6.0

    Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.

  25. PRAG: End-to-End Privacy-Preserving Retrieval-Augmented Generation

    cs.CR 2026-04 unverdicted novelty 6.0

    PRAG delivers end-to-end private RAG with 72-74% recall via non-interactive homomorphic approximations, interactive client assistance, and operation-error estimation to preserve ranking quality.

  26. Mixture of Heterogeneous Grouped Experts for Language Modeling

    cs.CL 2026-04 unverdicted novelty 6.0

    MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.

  27. How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

    cs.LG 2026-04 unverdicted novelty 6.0

    LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.

  28. SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

    cs.NI 2026-04 unverdicted novelty 6.0

    SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...

  29. Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

    cs.CL 2026-04 unverdicted novelty 6.0

    POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instr...

  30. Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation

    cs.LG 2026-04 unverdicted novelty 6.0

    Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.

  31. Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification

    cs.AI 2026-04 unverdicted novelty 6.0

    Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.

  32. BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models

    cs.CR 2026-04 unverdicted novelty 6.0

    BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.

  33. Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

    cs.CL 2026-04 conditional novelty 6.0

    Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-paramete...

  34. Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

    cs.LG 2026-04 unverdicted novelty 6.0

    Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.

  35. LLaDA2.0: Scaling Up Diffusion Language Models to 100B

    cs.LG 2025-12 conditional novelty 6.0

    LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.

  36. Kimi Linear: An Expressive, Efficient Attention Architecture

    cs.CL 2025-10 unverdicted novelty 6.0

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  37. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  38. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  39. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  40. Measuring short-form factuality in large language models

    cs.CL 2024-11 unverdicted novelty 6.0

    SimpleQA is a new benchmark of short, single-answer factual questions collected adversarially against GPT-4 to evaluate LLM factuality and confidence calibration.

  41. Towards Understanding Sycophancy in Language Models

    cs.CL 2023-10 conditional novelty 6.0

    Sycophancy is prevalent in state-of-the-art AI assistants and is likely driven in part by human preferences that favor agreement over truthfulness.

  42. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    cs.LG 2023-09 accept novelty 6.0

    DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

  43. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

    cs.CL 2023-02 unverdicted novelty 6.0

    Semantic entropy improves uncertainty estimation in natural language generation by incorporating semantic equivalences, outperforming standard entropy baselines on predicting model accuracy for question answering.

  44. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  45. How Much Knowledge Can You Pack Into the Parameters of a Language Model?

    cs.CL 2020-02 accept novelty 6.0

    Fine-tuned language models store knowledge in parameters to answer questions competitively with retrieval-based open-domain QA systems.

  46. FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    FreezeEmpath achieves emotionally expressive speech output and strong performance on empathetic dialogue, speech emotion recognition, and spoken QA tasks by training with a frozen LLM on existing speech datasets.

  47. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  48. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  49. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    cs.CL 2024-01 unverdicted novelty 5.0

    DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.

  50. Ministral 3

    cs.CL 2026-01 unverdicted novelty 4.0

    Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.

  51. Gemma: Open Models Based on Gemini Research and Technology

    cs.CL 2024-03 accept novelty 4.0

    Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.

  52. A Reproducibility Study of Metacognitive Retrieval-Augmented Generation

    cs.IR 2026-04 unverdicted novelty 3.0

    MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.

  53. Gemma 2: Improving Open Language Models at a Practical Size

    cs.CL 2024-07 conditional novelty 3.0

    Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 53 Pith papers · 1 internal anchor

  1. [1]

    Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs http://aclweb.org/anthology/D/D13/D13-1160.pdf. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGD...

  2. [2]

    Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. https://arxiv.org/abs/1506.02075 CoRR\/ abs/1506.02075. https://arxiv.org/abs/1506.02075 https://arxiv.org/abs/1506.02075

  3. [3]

    Jordan Boyd-Graber, Brianna Satinoff, He He, and Hal Daum\' e III. 2012. Besting the quiz master: Crowdsourcing incremental classification games http://www.aclweb.org/anthology/D12-1118. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning\/ . Association for Computatio...

  4. [4]

    Qingqing Cai and Alexander Yates. 2013. Large-scale semantic parsing via schema matching and lexicon extension http://www.aclweb.org/anthology/P13-1042. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)\/ . Association for Computational Linguistics, Sofia, Bulgaria, pages 423--433. http://ww...

  5. [5]

    Danqi Chen, Jason Bolton, and Christopher D. Manning. 2016. A thorough examination of the cnn/daily mail reading comprehension task http://www.aclweb.org/anthology/P16-1223. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)\/ . Association for Computational Linguistics, Berlin, Germany, page...

  6. [6]

    Matthew Dunn, Levent Sagun, Mike Higgins, Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. Searchqa: A new q&a dataset augmented with context from a search engine https://arxiv.org/abs/1704.05179. CoRR\/ https://arxiv.org/abs/1704.05179 https://arxiv.org/abs/1704.05179

  7. [7]

    Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. 2014. Open question answering over curated and extracted knowledge bases https://doi.org/10.1145/2623330.2623677. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\/ . ACM, New York, NY, USA, KDD '14, pages 1156--1165. https://doi.org/10.1145/2623330.262...

  8. [8]

    Paolo Ferragina and Ugo Scaiella. 2010. Tagme: On-the-fly annotation of short text fragments (by wikipedia entities) https://doi.org/10.1145/1871437.1871689. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management\/ . ACM, New York, NY, USA, CIKM '10, pages 1625--1628. https://doi.org/10.1145/1871437.1871689 https:/...

  9. [9]

    Kalyanpur, Adam Lally, J

    David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, and Chris Welty. 2010. Building watson: An overview of the deepqa project. AI MAGAZINE\/ 31(3):59--79

  10. [10]

    He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daum\' e III. 2016. Opponent modeling in deep reinforcement learning http://proceedings.mlr.press/v48/he16.html. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning\/ . PMLR, New York, New York, USA, volume 48 of Proceedings of Machin...

  11. [11]

    Karl Moritz Hermann, Tom\' a s Ko c isk\' y , Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend http://arxiv.org/abs/1506.03340. In Advances in Neural Information Processing Systems\/ . http://arxiv.org/abs/1506.03340 http://arxiv.org/abs/1506.03340

  12. [12]

    Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The goldilocks principle: Reading children's books with explicit memory representations https://arxiv.org/abs/1511.02301. CoRR\/ https://arxiv.org/abs/1511.02301 https://arxiv.org/abs/1511.02301

  13. [13]

    Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S. Weld. 2011. Knowledge-based weak supervision for information extraction of overlapping relations http://www.aclweb.org/anthology/P11-1055. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies\/ . Association for Com...

  14. [14]

    Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daum\' e III. 2014. A neural network for factoid question answering over paragraphs http://www.aclweb.org/anthology/D14-1070. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ . Association for Computational Linguistics, Doha, Qata...

  15. [15]

    Mandar Joshi, Uma Sawant, and Soumen Chakrabarti. 2014. Knowledge graph and corpus driven segmentation and answer inference for telegraphic entity-seeking queries http://www.aclweb.org/anthology/D14-1117. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)\/ . Association for Computational Linguistics, Doha, Q...

  16. [16]

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations https://arxiv.org/abs/1704.04683. CoRR\/ https://arxiv.org/abs/1704.04683 https://arxiv.org/abs/1704.04683

  17. [17]

    Peng Li, Wei Li, Zhengyan He, Xuguang Wang, Ying Cao, Jie Zhou, and Wei Xu. 2016. Dataset and neural recurrent sequence labeling model for open-domain factoid question answering https://arxiv.org/abs/1607.06275. CoRR\/ https://arxiv.org/abs/1607.06275 https://arxiv.org/abs/1607.06275

  18. [18]

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. https://arxiv.org/pdf/1611.09268.pdf MS MARCO : A human generated machine reading comprehension dataset . In Workshop in Advances in Neural Information Processing Systems\/ . https://arxiv.org/pdf/1611.09268.pdf https://arxiv.org/pdf/1611.09268.pdf

  19. [19]

    Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. 2016. Who did what: A large-scale person-centered cloze dataset https://aclweb.org/anthology/D16-1241. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing\/ . Association for Computational Linguistics, Austin, Texas, pages 2230--2235. https://...

  20. [20]

    Denis Paperno, Germ\' a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. 2016. The lambada dataset: Word prediction requiring a broad discourse context http://www.aclweb.org/anthology/P16-1144. In Proceedings of the 54th Annual Meeting of the Association for Computatio...

  21. [21]

    Panupong Pasupat and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables http://aclweb.org/anthology/P/P15/P15-1142.pdf. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Pro...

  22. [22]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text https://aclweb.org/anthology/D16-1264. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing\/ . Association for Computational Linguistics, Austin, Texas, pages 2383--2392. https://aclweb....

  23. [23]

    Burges, and Erin Renshaw

    Matthew Richardson, Christopher J.C. Burges, and Erin Renshaw. 2013. http://www.aclweb.org/anthology/D13-1020 MCTest : A challenge dataset for the open-domain machine comprehension of text . In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing\/ . Association for Computational Linguistics, Seattle, Washington, USA, pag...

  24. [24]

    Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text http://dl.acm.org/citation.cfm?id=1889788.1889799. In Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases: Part III\/ . Springer-Verlag, Berlin, Heidelberg, ECML PKDD'10, pages 148--163. http:/...

  25. [25]

    Uma Sawant and Soumen Chakrabarti. 2013. Learning joint query interpretation and response ranking https://doi.org/10.1145/2488388.2488484. In Proceedings of the 22Nd International Conference on World Wide Web\/ . ACM, New York, NY, USA, WWW '13, pages 1099--1110. https://doi.org/10.1145/2488388.2488484 https://doi.org/10.1145/2488388.2488484

  26. [26]

    Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension https://arxiv.org/abs/1611.01603. In Proceedings of the International Conference on Learning Representations (ICLR)\/ . https://arxiv.org/abs/1611.01603 https://arxiv.org/abs/1611.01603

  27. [27]

    Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2016. Newsqa: A machine comprehension dataset https://arxiv.org/abs/1611.09830. CoRR\/ https://arxiv.org/abs/1611.09830 https://arxiv.org/abs/1611.09830

  28. [28]

    Voorhees and Dawn M

    Ellen M. Voorhees and Dawn M. Tice. 2000. Building a question answering test collection https://doi.org/10.1145/345508.345577. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval\/ . ACM, New York, NY, USA, SIGIR '00, pages 200--207. https://doi.org/10.1145/345508.345577 https://doi.org...

  29. [29]

    Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. 2015. Machine comprehension with syntax, frames, and semantics http://www.aclweb.org/anthology/P15-2115. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)\/ ...

  30. [30]

    Burges, Krysta M

    Qiang Wu, Christopher J. Burges, Krysta M. Svore, and Jianfeng Gao. 2010. Adapting boosting for information retrieval measures https://doi.org/10.1007/s10791-009-9112-1. Inf. Retr.\/ 13(3):254--270. https://doi.org/10.1007/s10791-009-9112-1 https://doi.org/10.1007/s10791-009-9112-1

  31. [31]

    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention https://arxiv.org/abs/1502.03044. In Proceedings of the International Conference on Machine Learning\/ . https://arxiv.org/abs/1502.03044 https://arxiv.o...

  32. [32]

    Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain question answering http://aclweb.org/anthology/D15-1237. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing\/ . Association for Computational Linguistics, Lisbon, Portugal, pages 2013--2018. http://aclweb.org/anthology/D15-1...

  33. [33]

    Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification http://www.aclweb.org/anthology/N16-1174. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies\/ . Association for Compu...