hub Tool reference

arXiv preprint arXiv:1704.04683 , year=

Race: Large-scale reading comprehension dataset from examinations , author= · 2017 · cs.CL · arXiv 1704.04683

Tool reference. 83% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.

25 Pith papers citing it

Method reference 83% of classified citations

open full Pith review browse 25 citing papers arXiv PDF

abstract

We present RACE, a new dataset for benchmark evaluation of methods in the reading comprehension task. Collected from the English exams for middle and high school Chinese students in the age range between 12 to 18, RACE consists of near 28,000 passages and near 100,000 questions generated by human experts (English instructors), and covers a variety of topics which are carefully designed for evaluating the students' ability in understanding and reasoning. In particular, the proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models (43%) and the ceiling human performance (95%). We hope this new dataset can serve as a valuable resource for research and evaluation in machine comprehension. The dataset is freely available at http://www.cs.cmu.edu/~glai1/data/race/ and the code is available at https://github.com/qizhex/RACE_AR_baselines.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 5 background 1

citation-polarity summary

use dataset 5 background 1

representative citing papers

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

cs.CL · 2017-05-09 · accept · novelty 8.0

TriviaQA is a new large-scale dataset for reading comprehension that features complex compositional questions, high lexical variability, and cross-sentence reasoning requirements, where current baselines reach only 40% while humans reach 80%.

PRIMETIME : Limits of LLMs in Temporal Primitives

cs.NE · 2025-04-22 · unverdicted · novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

Multitask Prompted Training Enables Zero-Shot Task Generalization

cs.LG · 2021-10-15 · conditional · novelty 7.0

Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.

PubMedQA: A Dataset for Biomedical Research Question Answering

cs.CL · 2019-09-13 · unverdicted · novelty 7.0

PubMedQA supplies 273k+ biomedical QA instances that require reasoning over research abstracts to produce yes/no/maybe answers.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

cs.CL · 2019-06-19 · accept · novelty 7.0

XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

cs.CL · 2019-09-17 · unverdicted · novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.

Short window attention enables long-term memorization

cs.LG · 2025-09-29 · unverdicted · novelty 6.0

Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

cs.CL · 2024-11-15 · conditional · novelty 6.0

Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.

An Empirical Study of Mamba-based Language Models

cs.LG · 2024-06-12 · accept · novelty 6.0

An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.

Chain-of-Verification Reduces Hallucination in Large Language Models

cs.CL · 2023-09-20 · unverdicted · novelty 6.0

Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.

SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

cs.CV · 2025-08-25 · unverdicted · novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.

Dream 7B: Diffusion Large Language Models

cs.CL · 2025-08-21 · unverdicted · novelty 6.0

Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and quality-speed tradeoffs.

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

cs.CV · 2025-04-14 · conditional · novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

cs.LG · 2023-09-25 · accept · novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

cs.CL · 2026-05-19 · unverdicted · novelty 5.0

Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.

The Efficiency Gap in Byte Modeling

cs.LG · 2026-05-13 · unverdicted · novelty 5.0

Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.

Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

cs.LG · 2026-04-19 · unverdicted · novelty 5.0

ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

cs.CL · 2019-07-26 · accept · novelty 5.0

With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.

citing papers explorer

Showing 25 of 25 citing papers.

Language Models are Few-Shot Learners cs.CL · 2020-05-28 · accept · none · ref 47
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension cs.CL · 2017-05-09 · accept · none · ref 16
TriviaQA is a new large-scale dataset for reading comprehension that features complex compositional questions, high lexical variability, and cross-sentence reasoning requirements, where current baselines reach only 40% while humans reach 80%.
PRIMETIME : Limits of LLMs in Temporal Primitives cs.NE · 2025-04-22 · unverdicted · none · ref 67 · internal anchor
PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
Multitask Prompted Training Enables Zero-Shot Task Generalization cs.LG · 2021-10-15 · conditional · none · ref 25 · internal anchor
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
PubMedQA: A Dataset for Biomedical Research Question Answering cs.CL · 2019-09-13 · unverdicted · none · ref 34 · internal anchor
PubMedQA supplies 273k+ biomedical QA instances that require reasoning over research abstracts to produce yes/no/maybe answers.
XLNet: Generalized Autoregressive Pretraining for Language Understanding cs.CL · 2019-06-19 · accept · none · ref 18 · internal anchor
XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 148
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism cs.CL · 2019-09-17 · unverdicted · none · ref 15
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
Short window attention enables long-term memorization cs.LG · 2025-09-29 · unverdicted · none · ref 22 · internal anchor
Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization cs.CL · 2024-11-15 · conditional · none · ref 42 · internal anchor
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
An Empirical Study of Mamba-based Language Models cs.LG · 2024-06-12 · accept · none · ref 29 · internal anchor
An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
Chain-of-Verification Reduces Hallucination in Large Language Models cs.CL · 2023-09-20 · unverdicted · none · ref 86 · internal anchor
Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable cs.AI · 2026-05-08 · unverdicted · none · ref 25
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask cs.LG · 2026-05-07 · unverdicted · none · ref 19
SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling cs.CL · 2026-04-27 · unverdicted · none · ref 24
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 57
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
Dream 7B: Diffusion Large Language Models cs.CL · 2025-08-21 · unverdicted · none · ref 15
Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and quality-speed tradeoffs.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 59
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 118
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models cs.LG · 2023-09-25 · accept · none · ref 151
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering cs.CL · 2026-05-19 · unverdicted · none · ref 11 · internal anchor
Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.
The Efficiency Gap in Byte Modeling cs.LG · 2026-05-13 · unverdicted · none · ref 37 · internal anchor
Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods cs.LG · 2026-04-19 · unverdicted · none · ref 22
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
RoBERTa: A Robustly Optimized BERT Pretraining Approach cs.CL · 2019-07-26 · accept · none · ref 22
With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.
Machine Reading Comprehension: a Literature Review cs.CL · 2019-06-30 · unverdicted · none · ref 25 · internal anchor
A 2019 survey of machine reading comprehension corpora and methods.

arXiv preprint arXiv:1704.04683 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer