arxiv: 2605.11663 · v1 · submitted 2026-05-12 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

Kyosuke Takami , Yuka Tateisi , Satoshi Sekine , Yusuke Miyao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:08 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal benchmarkeducational AI evaluationJapanese national assessmentstudent response distributionsMLLM testingauthentic exam datavisual reasoningK-12 assessment

0 comments

The pith

A dataset built from real Japanese middle-school exams and nearly 900,000 student answers creates a benchmark that lets multimodal AI models be scored directly against human performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a multimodal dataset drawn from Japan's National Assessment of Academic Ability for middle schools. It keeps the original exam layouts, diagrams, and Japanese text in science, mathematics, and language subjects, then adds nationwide aggregated response data from about 900,000 students. Researchers run recent multimodal large language models on the items using exact-match accuracy and character-level F1 for open responses, finding clear differences by subject and extra difficulty when visuals are required. Human checks and LLM-as-judge comparisons test how well automatic scoring works. The result is a reusable, human-grounded resource for evaluating AI in authentic educational settings.

Core claim

The authors construct a multimodal dataset from officially released middle-school exam items in Science, Mathematics, and Japanese Language that preserves real layouts, diagrams, and educational text together with nationwide aggregated student response distributions from roughly 900,000 students. Benchmarking of multimodal LLMs on exact-match accuracy and character-level F1 reveals substantial performance variation across subjects and strong sensitivity to visual reasoning demands, while human evaluation and LLM-as-judge analyses assess the reliability of automatic scoring.

What carries the argument

The multimodal benchmark dataset itself, built from authentic exam items and student response distributions, which supplies both the questions and the human performance baselines for unified model evaluation.

If this is right

Model outputs can be compared directly to real nationwide student performance distributions rather than synthetic or expert-curated labels.
Performance gaps become visible between subjects and between text-only versus diagram-heavy questions.
Automatic scoring methods can be checked for reliability through side-by-side human judgments on open-ended answers.
The same items support research on feedback generation and explainable AI tailored to authentic assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could reveal whether models reproduce the specific error patterns students show on Japanese curriculum content.
Extending the same construction method to other countries or grade levels would test how well the benchmark generalizes beyond one national system.
Using the full response distributions rather than single correct answers opens evaluation of how closely model mistakes match human ones.

Load-bearing premise

The officially released exam items and their aggregated nationwide student response distributions form a valid and unbiased test bed that reflects the actual visual and linguistic reasoning demands placed on students.

What would settle it

If model accuracy patterns across question types fail to align with the observed student response distributions, especially on items that differ mainly in their visual or textual demands, the benchmark's claim to be a human-grounded test would not hold.

Figures

Figures reproduced from arXiv: 2605.11663 by Kyosuke Takami, Satoshi Sekine, Yuka Tateisi, Yusuke Miyao.

**Figure 1.** Figure 1: An example of the original exam page. to compare model performance with human learners in a meaningful way.However, constructing such datasets from real-world assessments is itself non-trivial. Raw exam materials are typically released as PDFs with complex multimodal layouts, requiring careful reconstruction of reading order, visual structure, and answer formats before they can be used for systematic eval… view at source ↗

**Figure 2.** Figure 2: Pipeline for constructing the benchmark from official exam PDFs and assessment reports. istry of Education, Culture, Sports, Science and Technology (MEXT), n.d.; 文部科学省, n.d.). The assessment combines subject tests with student and school questionnaires to support national PDCA-based educational policy and local classroom improvement. The core subjects̶Japanese and mathematics̶are assessed every year, whi… view at source ↗

**Figure 3.** Figure 3: Markdown file for a multimodal question. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy heatmaps for multiple-choice (MC) items across three subjects. Each row represents a model, and each column corresponds to an item ID. Color intensity encodes model accuracy (0 or 1) for LLMs and empirical correct-answer rates for students. Lighter cells indicate correctly answered items, while darker ones indicate incorrect predictions. Student answer rates (bottom row) are annotated for compar… view at source ↗

read the original abstract

Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed from Japan's National Assessment of Academic Ability, comprising officially released middle-school items in Science, Mathematics, and Japanese Language. Unlike existing benchmarks based on synthetic or curated data, our dataset preserves real exam layouts, diagrams, and Japanese educational text, together with nationwide aggregated student response distributions (N $\approx$ 900{,}000). These features enable direct comparison between human and model performance under a unified evaluation framework. We benchmark recent multimodal LLMs using exact-match accuracy and character-level F1 for open-ended responses, observing substantial variation across subjects and strong sensitivity to visual reasoning demands. Human evaluation and LLM-as-judge analyses further assess the reliability of automatic scoring. Our dataset establishes a reproducible, human-grounded benchmark for multimodal educational reasoning and supports future research on evaluation, feedback generation, and explainable AI in authentic assessment contexts. Our dataset is available at: https://github.com/KyosukeTakami/gakucho-benchmark

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This releases a real Japanese national exam dataset with student response aggregates at scale, which is a practical step for education-focused multimodal benchmarks but leaves some representativeness questions open.

read the letter

The main thing to know is that the authors built a multimodal dataset from official middle-school items in Japan's National Assessment of Academic Ability, keeping the original layouts, diagrams, and Japanese text, then attached nationwide student response distributions from roughly 900,000 test-takers. They run basic benchmarks on recent MLLMs using exact-match and character F1, plus some checks on automatic scoring reliability. That combination of authentic materials and human performance baselines is the concrete addition here, and it is not duplicated in the benchmarks they cite.

Referee Report

2 major / 1 minor

Summary. The paper presents a multimodal dataset constructed from officially released middle-school items in Science, Mathematics, and Japanese Language from Japan's National Assessment of Academic Ability, paired with nationwide aggregated student response distributions (N ≈ 900,000). It benchmarks recent MLLMs using exact-match accuracy and character-level F1 on open-ended responses, reports variation across subjects and sensitivity to visual demands, and includes human evaluation plus LLM-as-judge analyses to assess automatic scoring reliability. The work positions the dataset as a reproducible, human-grounded benchmark for multimodal educational reasoning.

Significance. If the selected items and aggregated distributions accurately reflect authentic visual and linguistic reasoning loads without distortion, the release would supply a high-validity, large-scale resource that fills a gap in Japanese K-12 grounded benchmarks and enables direct human-model comparisons, reproducible evaluation, and downstream work on feedback generation and explainable AI in assessment contexts. The public GitHub release and use of real exam layouts are concrete strengths.

major comments (2)

[Abstract and §2] Abstract and §2 (dataset construction): the central claim that the dataset forms a 'human-grounded' and 'unbiased' test bed for multimodal reasoning rests on the assumption that officially released items plus aggregated nationwide distributions accurately capture student reasoning demands. The manuscript provides no explicit criteria for item selection, no description of the aggregation methodology (e.g., how partial-credit or common-error patterns are handled), and no analysis of potential selection or digitization biases. This directly undermines the validity of human-model performance comparisons.
[§4] §4 (benchmarking and evaluation): the reported 'strong sensitivity to visual reasoning demands' and automatic-scoring reliability are load-bearing for the benchmark's utility, yet the paper does not detail prompt framing, image resolution/resizing choices, or text-extraction procedures. Without these, it is impossible to determine whether observed model-human gaps reflect genuine reasoning differences or artifacts introduced by converting paper layouts to model inputs.

minor comments (1)

[Abstract] The abstract contains minor LaTeX formatting artifacts (e.g., '900{,}000') that should be cleaned for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify how to strengthen the presentation of our dataset's construction and evaluation procedures. We address each major comment point by point below, indicating planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract and §2] Abstract and §2 (dataset construction): the central claim that the dataset forms a 'human-grounded' and 'unbiased' test bed for multimodal reasoning rests on the assumption that officially released items plus aggregated nationwide distributions accurately capture student reasoning demands. The manuscript provides no explicit criteria for item selection, no description of the aggregation methodology (e.g., how partial-credit or common-error patterns are handled), and no analysis of potential selection or digitization biases. This directly undermines the validity of human-model performance comparisons.

Authors: We agree that greater transparency on these points is needed to fully substantiate the human-grounded and unbiased framing. The items consist of all officially released middle-school questions in the three subjects from the public National Assessment releases, and the response distributions are the official nationwide aggregates released by the assessment authority. In the revised manuscript, we will expand §2 with: explicit selection criteria (all available items without further filtering); a description of the aggregation process as reported in the official data releases (counts per response category for both multiple-choice and open-ended items, with no partial-credit scoring applied by the source); and a brief discussion of digitization (manual recreation of diagrams and layouts from original PDFs to preserve visual structure, with no automated OCR for Japanese text). This addition will directly support the validity of human-model comparisons. revision: yes
Referee: [§4] §4 (benchmarking and evaluation): the reported 'strong sensitivity to visual reasoning demands' and automatic-scoring reliability are load-bearing for the benchmark's utility, yet the paper does not detail prompt framing, image resolution/resizing choices, or text-extraction procedures. Without these, it is impossible to determine whether observed model-human gaps reflect genuine reasoning differences or artifacts introduced by converting paper layouts to model inputs.

Authors: We concur that these methodological details are required for reproducibility and to rule out input artifacts. In the revised §4, we will add a dedicated subsection specifying: the exact prompt templates and framing used for each evaluated MLLM (including any system instructions and output format requirements, to be included in full in an appendix); image handling (original exam page scans resized to a maximum of 512 pixels on the longer side while preserving aspect ratio and without cropping or additional augmentation); and text extraction (manual transcription of all Japanese text by native speakers, verified against the source PDFs to ensure fidelity). These clarifications will confirm that the reported sensitivities and scoring reliability reflect model performance on authentic inputs. revision: yes

Circularity Check

0 steps flagged

No significant circularity: data release with direct empirical evaluation

full rationale

The paper is a dataset release accompanied by empirical benchmarking of MLLMs on officially released exam items and aggregated human response distributions. No derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. Claims rest on the external validity of the source data and direct accuracy/F1 comparisons rather than any self-referential loop, self-citation chain, or renamed known result. This is the standard non-circular outcome for a benchmark data paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset release paper. No mathematical derivations, free parameters, or new postulated entities are introduced.

pith-pipeline@v0.9.0 · 5515 in / 1039 out tokens · 52805 ms · 2026-05-13T01:08:04.791967+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a multimodal dataset constructed from Japan’s National Assessment of Academic Ability... nationwide aggregated student response distributions (N ≈ 900,000).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pipeline... segment each page into minimal visual units... normalize reading order... JSON schema that links subquestions to the panels they reference.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · 6 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Publications Manual , year = "1983", publisher =

work page 1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[5]

Dan Gusfield , title =. 1997

work page 1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[8]

The Limits of Interpretation

Umberto Eco. The Limits of Interpretation

work page
[9]

Temporal Tagging on Different Domains: Challenges, Strategies, and Gold Standards

Jannik Strötgen and Michael Gertz. Temporal Tagging on Different Domains: Challenges, Strategies, and Gold Standards. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). 2012

work page 2012
[10]

Chercheur

J.L. Chercheur. Case-Based Reasoning. 1994

work page 1994
[11]

Castor and L

A. Castor and L. E. Pollux. The use of user modelling to guide inference and learning. Applied Intelligence. 1992

work page 1992
[12]

Superman and B

S. Superman and B. Batman and C. Catwoman and S. Spiderman. Superheroes experiences with books. Journal journal journal

work page
[13]

Elementary Statistics

Paul Gerhard Hoel. Elementary Statistics. 1971

work page 1971
[14]

1954--58

A history of technology. 1954--58

work page 1954
[15]

N. Chomsky. Conditions on Transformations. A festschrift for Morris Halle. 1973

work page 1973
[16]

Natural Fibre Twines

BSI. Natural Fibre Twines. 1973

work page 1973
[17]

Language: Its Nature, Development, and Origin

Otto Jespersen. Language: Its Nature, Development, and Origin

work page
[18]

ECCV , year=

Microsoft COCO: Common Objects in Context , author=. ECCV , year=

work page
[19]

ACL , year=

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , author=. ACL , year=

work page
[20]

CVPR , year=

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , author=. CVPR , year=

work page
[21]

NeurIPS Datasets and Benchmarks , year=

LAION-5B: An open large-scale dataset for training next generation image-text models , author=. NeurIPS Datasets and Benchmarks , year=

work page
[22]

ICCV , year=

VQA: Visual Question Answering , author=. ICCV , year=

work page
[23]

CVPR , year=

Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering , author=. CVPR , year=

work page
[24]

CVPR , year=

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , author=. CVPR , year=

work page
[25]

ACL , year=

A corpus for reasoning about natural language grounded in photographs , author=. ACL , year=

work page
[26]

CVPR , year=

Towards VQA Models That Can Read , author=. CVPR , year=

work page
[27]

WACV , year=

DocVQA: A Dataset for VQA on Document Images , author=. WACV , year=

work page
[28]

WACV , year=

InfographicVQA , author=. WACV , year=

work page
[29]

NeurIPS , year=

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. NeurIPS , year=

work page
[30]

ICLR , year=

MathVista: Evaluating Mathematical Reasoning in Visual Contexts , author=. ICLR , year=

work page
[31]

NAACL , year=

ChartQA: A Benchmark for Question Answering over Charts , author=. NAACL , year=

work page
[32]

CVPR , year=

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , author=. CVPR , year=

work page
[33]

ICCV , year=

Dense-Captioning Events in Videos , author=. ICCV , year=

work page
[34]

CVPR , year=

Ego4D: Around the World in 3,000 Hours of Egocentric Video , author=. CVPR , year=

work page
[35]

NeurIPS , year=

MMBench: Is Your Multimodal Model an All-around Player? , author=. NeurIPS , year=

work page
[36]

arXiv preprint arXiv:2307.16104 , year=

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. arXiv preprint arXiv:2307.16104 , year=

work page arXiv
[37]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MME: A Comprehensive Evaluation for Multimodal LLMs , author=. arXiv preprint arXiv:2306.13394 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

NeurIPS , year=

Visual Instruction Tuning , author=. NeurIPS , year=

work page
[39]

arXiv preprint arXiv:2306.05179 , year=

MMMU: A Massive Multi-discipline Multimodal Understanding Benchmark , author=. arXiv preprint arXiv:2306.05179 , year=

work page arXiv
[40]

arXiv preprint arXiv:2401.09978 , year=

Video-MME: Evaluating Multimodal LLMs on Video Understanding , author=. arXiv preprint arXiv:2401.09978 , year=

work page arXiv
[41]

ICLR , year=

Measuring Massive Multitask Language Understanding , author=. ICLR , year=

work page
[42]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. arXiv:2406.01574 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Agieval: A human-centric benchmark for evaluating foundation models

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models , author=. arXiv:2304.06364 , year=

work page arXiv
[44]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. arXiv:2305.08322 , year=

work page arXiv
[45]

CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a

CMMLU: Measuring Massive Multitask Language Understanding in Chinese , author=. arXiv:2306.09212 , year=

work page arXiv
[46]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author=. arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

NeurIPS Datasets and Benchmarks , year=

Measuring Mathematical Problem Solving with the MATH Dataset , author=. NeurIPS Datasets and Benchmarks , year=

work page
[49]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems , author=. arXiv:2402.14008 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

2024 , url=

Yue, Xiang and Ni, Yuansheng and others , booktitle=. 2024 , url=

work page 2024
[51]

2024 , url=

Yue, Xiang and others , journal=. 2024 , url=

work page 2024
[52]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. arXiv:2210.09261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

EMNLP Workshop BlackboxNLP , year=

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. EMNLP Workshop BlackboxNLP , year=

work page
[54]

NeurIPS , year=

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , author=. NeurIPS , year=

work page
[55]

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy , booktitle=

work page
[56]

Know What You Don't Know: Unanswerable Questions for

Rajpurkar, Pranav and Jia, Robin and Liang, Percy , booktitle=. Know What You Don't Know: Unanswerable Questions for

work page
[57]

Transactions of the ACL , year=

Natural Questions: A Benchmark for Question Answering Research , author=. Transactions of the ACL , year=

work page
[58]

ACL , year=

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. ACL , year=

work page
[59]

NAACL-HLT , year=

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , author=. NAACL-HLT , year=

work page
[60]

Transactions of the ACL , year=

Neural Network Acceptability Judgments , author=. Transactions of the ACL , year=

work page
[61]

EMNLP , year=

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , author=. EMNLP , year=

work page
[62]

SemEval , year=

SemEval-2017 Task 1: Semantic Textual Similarity—Multilingual and Cross-lingual Focused Evaluation , author=. SemEval , year=

work page 2017
[63]

Think You Have Solved Question Answering? Try

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and others , journal=. Think You Have Solved Question Answering? Try

work page
[64]

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , booktitle=

work page
[65]

AAAI , year=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. AAAI , year=

work page
[66]

NAACL-HLT , year=

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , author=. NAACL-HLT , year=

work page
[67]

Bisk, Yonatan and Zellers, Rowan and Bras, Ronan Le and Gao, Jianfeng and Choi, Yejin , journal=

work page
[68]

NeurIPS , year=

Teaching Machines to Read and Comprehend , author=. NeurIPS , year=

work page
[69]

EMNLP , year=

Don't Give Me the Details, Just the Summary! Extreme Summarization with Topic-Aware Convolutional Networks , author=. EMNLP , year=

work page
[70]

2017 , url=

Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard , booktitle=. 2017 , url=

work page 2017
[71]

2020 , url=

Hardalov, Momchil and Mihaylov, Todor and Koychev, Ivan and Nakov, Preslav , booktitle=. 2020 , url=

work page 2020
[72]

2024 , url=

Zhang, Qian-Wen and Wang, Haochen and Li, Fang and An, Siyu and Qiao, Lingfeng and Gao, Liangcai and Yin, Di and Sun, Xing , journal=. 2024 , url=

work page 2024
[73]

and others , journal=

Zhang, X. and others , journal=. Evaluating the Performance of Large Language Models on. 2023 , url=

work page 2023
[74]

and others , journal=

Lei, Z. and others , journal=. 2024 , url=

work page 2024
[75]

and others , booktitle=

Zong, Y. and others , booktitle=. 2024 , url=

work page 2024
[76]

2023 , url=

Arora, Daman and Singh, Himanshu Gaurav and Mausam , booktitle=. 2023 , url=

work page 2023
[77]

2023 , url=

Zhang, Wenxuan and Aljunied, Sharifah Mahani and Gao, Chang and Chia, Yew Ken and Bing, Lidong , booktitle=. 2023 , url=

work page 2023
[78]

2024 , url=

Das, Rajarshi and others , booktitle=. 2024 , url=

work page 2024
[79]

Improvement of Academic Abilities (Courses of Study) , author =. n.d. , urldate =

work page
[80]

全国学力・学習状況調査の概要 , author =. n.d. , urldate =

work page

Showing first 80 references.