arxiv: 2604.13058 · v2 · submitted 2026-03-18 · 💻 cs.CL · cs.LG· cs.MM

Recognition: no theorem link

KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

Nahyun Lee , Guijin Son , Hyunwoo Ko , Chanyoung Kim , JunYoung An , Kyubeen Han , Il-Youp Kwak

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:32 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.MM

keywords Korean multimodal benchmarkvision-language modelscultural evaluationexam questionsmodel performance gapslocal conventionsmultimodal understanding

0 comments

The pith

A new benchmark of 3,466 native Korean exam questions shows current multimodal models reaching at most 42 percent accuracy, with larger gaps on items requiring local conventions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces KMMMU as a collection of exam questions written originally in Korean across nine disciplines and nine visual formats. Unlike benchmarks built from English sources or translations, these problems embed Korean institutional standards, cultural references, and discipline-specific diagrams. Experiments find the best open-source model at 42.05 percent overall while the strongest proprietary system reaches only 52.42 percent on the hardest 627 questions. Korean-specific items produce accuracy drops as large as 13.43 percent. Error patterns point to trouble with mapping local conventions to answers, recalling region-specific facts, and applying domain rules rather than failures of general reasoning.

Core claim

KMMMU contains 3,466 questions drawn from Korean exams that cover nine disciplines and nine visual modality categories, plus a 300-item Korean-specific subset and a 627-item hard subset. Leading open-source models achieve 42.05 percent accuracy on the full set, while the best proprietary model reaches 52.42 percent only on the hard subset. Accuracy varies sharply by discipline and falls further on Korean-specific questions by up to 13.43 percent. Failures arise mainly from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding rather than from insufficient reasoning depth.

What carries the argument

The KMMMU benchmark itself, built from native Korean exam questions that embed local cultural conventions and visual formats.

Load-bearing premise

The selected Korean exam questions and visual formats represent the distribution of real-world Korean multimodal tasks without major selection bias toward certain topics or difficulty levels.

What would settle it

A new collection of Korean exam or workplace questions, collected independently, on which the same models achieve markedly higher accuracy than the 42 percent reported here would indicate the benchmark overstates the difficulty of typical Korean multimodal tasks.

Figures

Figures reproduced from arXiv: 2604.13058 by Chanyoung Kim, Guijin Son, Hyunwoo Ko, Il-Youp Kwak, JunYoung An, Kyubeen Han, Nahyun Lee.

**Figure 1.** Figure 1: Comparison of English (MMMU, MMMU-Pro), Japanese (JMMMU, JMMMU-Pro), and Korean (others) multimodal benchmarks. Each point is positioned by benchmark size (x-axis, log scale) and difficulty proxy (100 − peak public score), with lighter colors indicating more recent releases. Shaded regions mark two common limitations: small size (left) and low headroom (bottom). et al., 2024; Guan et al., 2024). Past eva… view at source ↗

**Figure 2.** Figure 2: Examples of KMMMU questions. Examples include the original questions, associated images, English translations, and metadata such as visual modality, question format, and Korean-specific labels. model reaching 52.42% on the hard subset. Performance varies substantially across disciplines, and gains from model scale and explicit reasoning are uneven. Korean-specific questions remain particularly challengin… view at source ↗

**Figure 3.** Figure 3: Discipline-wise visual modality composition of KMMMU. Stacked bars show the number of questions for each visual modality in each discipline, with total counts shown beneath the labels. Scatter points indicate Korean-specific items overlaid on the corresponding discipline– modality segments, and jittered randomly [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Distributional integrity after adversarial filtering. Question text embeddings from the original 68k corpus, the KMMMU Full set, and the Hard subset are projected using PCA followed by 3D UMAP. Both filtered subsets largely preserve the global structure of the original distribution. 4 Experimental setup 4.1 Evaluated Models We evaluate a diverse set of multimodal models covering both open-source and prop… view at source ↗

**Figure 5.** Figure 5: Reasoning gain by discipline and question type in Qwen3-VL-32B (IT vs. Thinking). Numbers in parentheses indicate the total number of questions in each category. Reasoning gains in answer composition tasks. Among cases where Qwen3-VL-32B-Thinking succeeds and Qwen3-VL-32B-IT fails, the clearest gains appear on questions that require answer composition. This is especially visible for open-ended questions … view at source ↗

**Figure 6.** Figure 6: Example of a Korean-specific regulatory category mismatch. Qwen3-VL-235B-A22B-IT reads the table correctly, but maps small vehicle to the wrong category and applies the wrong standard. This is a failure of institutional knowledge recall and lexical category matching, not OCR. ferred language, introducing noise and reducing task performance. Overall, these errors suggest that Korean-specific failures arise … view at source ↗

**Figure 7.** Figure 7: Annotation tool interface used for OCR verification. The tool displays the original PDF page on the left and the parsed text and images on the right, allowing annotators to correct OCR errors and validate image cropping in real time. C Korean-Specific Context To provide a concrete illustration of KMMMU, [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Data Card for a Korean-Specific Question. The figure aggregates the raw inputs and their translations. [Original Image] The original visual input containing a text-rich regulation box. [Original Question] The original question text in Korean. [Translation] English translations for both the visual context and the question. Correctly answering this question requires retrieving specific legal provisions regar… view at source ↗

**Figure 9.** Figure 9: Representative examples of fine-grained visual types in KMMMU. Before consolidation into the final macro-level visual modality categories, the dataset included diverse fine-grained visual types, such as specialized engineering diagrams, document-style text images, and South Korean geographic maps. difficult. Even the strongest model remains below 20% overall accuracy, with VARCO-VISION-2.0- 1.7B achieving … view at source ↗

**Figure 10.** Figure 10: Per-dimension density comparison after adversarial filtering. Kernel density estimates over the three UMAP dimensions for the original 68k corpus, the KMMMU Full set, and the Hard subset. The filtered subsets broadly retain the major density peaks and multimodal trends of the original distribution, although the Hard subset shows a somewhat larger deviation in Dimension 3. Engr. (235) Natural Sci. (154) CS… view at source ↗

**Figure 11.** Figure 11: Discipline-wise visual modality composition of KMMMU Hard Set. Stacked bars show the number of ques- tions for each visual modality in each discipline, with total counts shown beneath the labels. Scatter points indicate Korean-specific items overlaid on the corre- sponding discipline–modality segments. The hard subset is concentrated in Engineering and Natural Sciences, similar to Full set. advantage ove… view at source ↗

**Figure 12.** Figure 12: Annotation interface for manual validation of LLM-Judge outputs. For each sample, annotators review the question, image, gold answer, model response, and parsed answer, and record parsing consistency, correctness judgments, metadata consistency, and optional comments. Model Parsed acc. (no answer) Response acc. (no answer) Parsed acc. (answered) Parsed κ (answered) Response acc. (answered) Response κ (ans… view at source ↗

**Figure 13.** Figure 13: Accuracy by visual modality for Qwen3- VL-32B-IT and Qwen3-VL-32B-Thinking. Performance remains broadly similar across visual modality categories, suggesting that explicit reasoning does not systematically change raw visual evidence extraction. The main differences appear to arise after evidence extraction, such as in task framing, constraint tracking, and answer finalization. terns of localized knowled… view at source ↗

**Figure 14.** Figure 14: Rigid conceptual framing in a Natural Sciences reversal. Qwen3-VL-32B-IT correctly applies the relevant conductivity criterion, whereas Qwen3-VL-32B-Thinking overcommits to an overly rigid band-gap-based schema and rejects the crucial statement about the Fermi level in the conduction band [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Structural misinterpretation in an Engineering reversal (Fault Tree analysis). Qwen3-VL-32B-IT partially corrects an early gate-level misinterpretation, whereas Qwen3-VL-32B-Thinking persists with an incorrect top-level gate reading and derives the wrong recovery point. [Question - English Translation] [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Exact architectural category misclassification in Arts & Design. The model recognizes the overall structure of the plan, but fails to map it to the correct standardized architectural category. Instead, it overcommits to an orthogonal-plan interpretation and supports it with a plausible but incorrect villa association [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Few-shot symbolic induction failure in General. A Shanghainese-language item requiring the model to infer a latent mapping from a small set of diagram–expression pairs and apply it to new cases. Although Qwen3- VL-235B-A22B-IT produces a detailed step-by-step analysis, it fails to recover the full correspondence system and instead relies on partial surface analogies, leading to a plausible but incorrect a… view at source ↗

**Figure 18.** Figure 18: Exact standards and rule-criterion misapplication in General. An item testing the official Romanization rule for hyphen use. The model gives a broadly reasonable explanation of the rule, but selects the wrong option because it applies an approximate plausibility-based criterion rather than the exact condition required by the formal standard [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

read the original abstract

We introduce KMMMU, a native Korean benchmark for evaluating multimodal understanding in Korean cultural and institutional settings. KMMMU contains 3,466 questions from exams natively written in Korean, covering nine disciplines and nine visual modality categories, along with a 300-item Korean-specific subset and a hard subset of 627 questions. Unlike translated or English-centric benchmarks, KMMMU targets information-dense problems shaped by local conventions, official standards, and discipline-specific visual formats. Experiments show that the strongest open-source model reaches only 42.05% accuracy on the full set, while the best proprietary model achieves 52.42% on the hard subset. Performance varies across disciplines, with some disciplines emerging as bottlenecks, and Korean-specific questions showing gaps of up to 13.43%. Error analysis suggests that these failures stem less from insufficient reasoning depth than from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding. KMMMU provides a testbed for multimodal evaluation beyond English-centric benchmarks and for developing more reliable systems for expert real-world tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KMMMU gives a practical native Korean multimodal benchmark with real performance gaps, but the Korean-specific subset selection lacks enough detail to rule out curation bias.

read the letter

The main takeaway is that this paper delivers a new benchmark built from actual Korean exams rather than translations, and the reported numbers show current models still struggle, especially on the Korean-only items and certain disciplines. The strongest open-source model hits only 42% on the full set, with proprietary models doing better but still below 53% on the hard subset, and gaps up to 13% on the Korean-specific questions. That part is useful because it points to concrete weaknesses in convention-to-label mapping and localized knowledge rather than pure reasoning failures. The error analysis breaks things down by category and modality, which helps readers see where the bottlenecks sit across the nine disciplines and visual types. What stands out as new is the 300-item Korean-specific subset and the hard subset of 627 questions, drawn directly from local sources instead of adapting English benchmarks like MMMU. The paper also covers nine visual modality categories that reflect Korean institutional formats. On the positive side, the experiments cover multiple model types and give clear accuracy tables, so the central claims rest on observable results rather than speculation. The soft spot is the curation process for the Korean-specific subset and the hard questions. The abstract mentions exams natively written in Korean but does not spell out the sampling protocol, how items were labeled as Korean-specific, or any checks that the subsets match the difficulty and visual density of the full set. If harder or more complex items were over-selected for those slices, the reported gaps could partly reflect that choice instead of true cultural or linguistic shortfalls. Prompting details and inter-annotator agreement on the subsets would also help confirm the numbers are stable. This paper is aimed at researchers working on multilingual multimodal systems who want a non-English testbed that reflects real local standards. Anyone building or evaluating models for expert tasks in Korean or similar languages will find the numbers and error categories worth looking at. The work is grounded enough in concrete measurements to deserve a serious referee, even if the authors need to add the missing methodology sections before publication.

Referee Report

1 major / 2 minor

Summary. The paper introduces KMMMU, a benchmark of 3,466 questions drawn from native Korean exams spanning nine disciplines and nine visual modality categories. It includes a 300-item Korean-specific subset and a 627-question hard subset. Model evaluations report that the strongest open-source model achieves 42.05% accuracy on the full set while the best proprietary model reaches 52.42% on the hard subset, with performance varying by discipline and showing drops of up to 13.43% on Korean-specific items. Error analysis attributes failures primarily to weak convention-to-label mapping, localized knowledge recall, and domain-specific standards rather than insufficient reasoning depth.

Significance. If the subset construction is free of selection bias, KMMMU supplies a concrete, non-English multimodal testbed that quantifies gaps in handling culturally embedded visual and institutional content. The reported accuracy ceilings and discipline-specific bottlenecks provide falsifiable targets for future model development and underscore the limitations of English-centric training for expert-level tasks in other languages.

major comments (1)

[Abstract and benchmark construction] Abstract and benchmark construction section: The headline claim that Korean-specific questions produce gaps of up to 13.43% (and that these reflect localized knowledge rather than difficulty) rests on the unverified assumption that the 300-item subset and 627 hard questions are representative samples. No sampling protocol, inter-annotator agreement statistics for the 'Korean-specific' label, or comparative difficulty/visual-density distributions between the full set and the subsets are reported. If harder or more visually complex items were over-selected for the Korean-specific subset, the observed gaps could be artifacts of curation rather than evidence of model deficiencies in convention-to-label mapping.

minor comments (2)

[Experiments] Experiments section: Full details on prompting templates, few-shot examples, and any post-processing of model outputs should be provided to confirm that the accuracy numbers (42.05%, 52.42%) are not sensitive to undocumented choices.
[Error analysis] Error analysis: The categorization of errors into 'weak convention-to-label mapping' versus 'insufficient reasoning' would benefit from quantitative inter-annotator agreement on the error labels and a breakdown by visual modality.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We address the concern about subset representativeness and construction details below, and commit to adding the requested clarifications and statistics in the revised manuscript.

read point-by-point responses

Referee: [Abstract and benchmark construction] Abstract and benchmark construction section: The headline claim that Korean-specific questions produce gaps of up to 13.43% (and that these reflect localized knowledge rather than difficulty) rests on the unverified assumption that the 300-item subset and 627 hard questions are representative samples. No sampling protocol, inter-annotator agreement statistics for the 'Korean-specific' label, or comparative difficulty/visual-density distributions between the full set and the subsets are reported. If harder or more visually complex items were over-selected for the Korean-specific subset, the observed gaps could be artifacts of curation rather than evidence of model deficiencies in convention-to-label mapping.

Authors: We appreciate the referee highlighting the importance of transparency in subset construction. The 300-item Korean-specific subset was selected by native Korean domain experts who reviewed questions for explicit reliance on localized conventions, regulations, or cultural/institutional knowledge unique to Korea (e.g., specific Korean legal standards or historical references not covered in international equivalents). The 627-question hard subset was constructed by first running preliminary model evaluations and then supplementing with expert judgment on items requiring dense visual interpretation or multi-step domain reasoning. We acknowledge that the original manuscript did not include a full sampling protocol description, inter-annotator agreement figures, or comparative difficulty/visual-density statistics. In the revision we will add a dedicated subsection detailing the exact selection criteria and process, report inter-annotator agreement for the Korean-specific labeling (performed by multiple experts), and include side-by-side comparisons of metrics such as average token length, number of answer choices, proportion of image-containing questions, and image modality distribution across the full set, Korean-specific subset, and hard subset. These additions will allow readers to assess whether the observed 13.43% gaps are attributable to curation bias or to the intended factors of localized knowledge and convention-to-label mapping. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark evaluation

full rationale

The paper introduces KMMMU as a fixed benchmark of 3,466 exam questions and reports direct accuracy measurements on it (e.g., 42.05% for the strongest open-source model). No equations, fitted parameters, predictions, or derivations appear anywhere in the manuscript. Performance gaps and error analysis are computed from model outputs on the held-out test items; nothing reduces to prior quantities by construction. Self-citations, if present, are not load-bearing for any result. The evaluation is self-contained against external model runs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper. No mathematical derivations, fitted parameters, or new postulated entities are introduced; the central claims rest on the construction and evaluation of the dataset itself.

pith-pipeline@v0.9.0 · 5518 in / 1109 out tokens · 37369 ms · 2026-05-15T10:32:39.381860+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Byungjin Choi, Seongsu Bae, Sunjun Kweon, and Ed- ward Choi

Varco-vision-2.0 technical report.arXiv preprint arXiv:2509.10105. Byungjin Choi, Seongsu Bae, Sunjun Kweon, and Ed- ward Choi. 2026a. Kormedmcqa-v: A multimodal benchmark for evaluating vision-language models on the korean medical licensing examination.arXiv preprint arXiv:2602.13650. Dasol Choi, Guijin Son, Hanwool Lee, Minhyuk Kim, Hyunwoo Ko, Teabin L...

work page arXiv 2025
[2]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300. Seokhee Hong, Sunkyoung Kim, Guijin Son, Soyeon Kim, Yeonjung Hong, and Jinsik Lee. 2025. From kmmlu-redux to kmmlu-pro: A professional korean benchmark suite for llm evaluation.arXiv preprint arXiv:2507.08924. Taebaek Hwang, Minseo Kim, Gisang Lee, Seonuk Kim, and Hyunj...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[3]

Data cleaning and de-duplication.We first remove samples with invalid image links and de-duplicate near-duplicate questions across exam years using image and text similarity checks

work page
[4]

Each model is evaluated in a zero-shot setting, and questions answered correctly at each stage are removed from the candidate pool

Model-based adversarial filtering.We then sequentially filter the remaining candidate pool using four multimodal models: PHI- 3.5-VISION-INSTRUCT(Abdin et al., 2024), INTERNVL-3.5-38B (Wang et al., 2025), GEMINI-2.5-FLASH-LITE, and GEMINI-2.5- FLASH(Comanici et al., 2025). Each model is evaluated in a zero-shot setting, and questions answered correctly at...

work page 2024
[5]

extraction area slope criteria

Final retention.Only questions that remain unsolved after all four filtering stages are re- tained in the final benchmark. The resulting KMMMU benchmark contains 3,466curated questions. Figure 7:Annotation tool interface used for OCR verification.The tool displays the original PDF page on the left and the parsed text and images on the right, allowing anno...

work page 2015
[6]

Rilaq cheqleleq

Parallel - tw o parallel wings. 3 . Or t hogonal ( or cr oss) - wit h wings at right angles. The Villa Medici in Fiesole is consider ed an e xample of t he or t hogonal plan. So if t he image sho ws a cr oss shape (v er tical and horiz ontal ax es), t hen it's 직교형 and t he villa is 메디치장. So option 3 . </t hink> corr ect option: $\ \bo x ed{(3) 직교형－메디치장}$ ...

work page