DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

Wei He , Kai Liu , Jing Liu , Yajuan Lyu , Shiqi Zhao , Xinyan Xiao , Yuan Liu , Yizhong Wang

show 5 more authors

Hua Wu Qiaoqiao She Xuan Liu Tian Wu HaiFeng Wang

Authors on Pith no claims yet

classification 💻 cs.CL

keywords dureaderchinesecommunitydatasetimprovementsquestionsanswersbaidu

0 comments

read the original abstract

This paper introduces DuReader, a new large-scale, open-domain Chinese ma- chine reading comprehension (MRC) dataset, designed to address real-world MRC. DuReader has three advantages over previous MRC datasets: (1) data sources: questions and documents are based on Baidu Search and Baidu Zhidao; answers are manually generated. (2) question types: it provides rich annotations for more question types, especially yes-no and opinion questions, that leaves more opportunity for the research community. (3) scale: it contains 200K questions, 420K answers and 1M documents; it is the largest Chinese MRC dataset so far. Experiments show that human performance is well above current state-of-the-art baseline systems, leaving plenty of room for the community to make improvements. To help the community make these improvements, both DuReader and baseline systems have been posted online. We also organize a shared competition to encourage the exploration of more models. Since the release of the task, there are significant improvements over the baselines.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
cs.CL 2024-02 unverdicted novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.