pith. sign in

arxiv: 2604.06898 · v1 · submitted 2026-04-08 · 💻 cs.CY

Are LLMs Ready for Computer Science Education? A Cross-Domain, Cross-Lingual and Cognitive-Level Evaluation Using Professional Certification Exams

Pith reviewed 2026-05-10 17:51 UTC · model grok-4.3

classification 💻 cs.CY
keywords LLMscomputer science educationcertification examscognitive levelscross-lingual evaluationmodel benchmarkinglanguage performance
0
0 comments X

The pith

LLMs show partial readiness for computer science education with strengths in basic tasks but limitations in complex reasoning and language consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates four large language models on 1,068 questions from six professional certification exams covering networking, office applications, and Java programming. Performance is measured across English and Chinese, different cognitive levels, and robustness checks. Findings reveal model-specific advantages by language and a general drop in accuracy as questions demand more advanced thinking. The results support using LLMs in CS education for suitable tasks while identifying areas needing improvement.

Core claim

Benchmarking GPT-5, DeepSeek-R1, Qwen-Plus, and Llama-3.3-70B-Instruct on certification exams shows GPT-5 leading in English, Qwen-Plus in Chinese, DeepSeek-R1 with the best balance across languages, and Llama-3.3 with notable weaknesses in higher-order reasoning and robustness to input changes. All models scored lower on tasks requiring more complex cognitive skills.

What carries the argument

Systematic testing of LLMs on real certification exam questions grouped by cognitive complexity and language.

If this is right

  • LLMs can support basic-level computer science instruction and content generation.
  • Model choice for educational applications should consider the primary language of instruction.
  • Advanced courses emphasizing analysis and evaluation require human oversight when using LLMs.
  • Robustness issues in some models suggest careful testing before deployment in assessments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Curriculum developers might use these benchmarks to decide which model to integrate for specific course levels.
  • Similar testing frameworks could apply to evaluating LLMs in other educational domains like mathematics or engineering.
  • Addressing performance gaps in non-English languages could broaden the utility of these tools globally.

Load-bearing premise

That performance on certification exam questions reflects how well an LLM can aid computer science learning.

What would settle it

A study tracking actual student exam scores when LLMs are used as study aids compared to traditional study methods.

Figures

Figures reproduced from arXiv: 2604.06898 by Chen Gao, Chi Liu, Congcong Zhu, Dongfu Xiao, Huajie Chen, Maiying Sui, Sheng Shen, Tianqing Zhu, Wanlei Zhou, Xiaotong Han, Xuhan Zuo, Zhengquan Luo, Zongyuan Ge.

Figure 1
Figure 1. Figure 1: The Overall Research Framework foundation for cross-model evaluation. The overall research framework is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Prompt Construction Process 2.4. Cross-lingual Evaluation We constructed a parallel Chinese-English corpus using DeepSeek-V3 to automatically translate the original question set (DeepSeek-AI, 2024). This aimed to investigate how language affects the effectiveness of the models and to examine potential cross-linguistic biases. The dataset consisted of three examination banks in Chinese (NCRE MS Office m… view at source ↗
Figure 3
Figure 3. Figure 3: Radar chart of model accuracy rates (no masking) across six professional certi [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of model performance on original versus translated questions. (a) [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Model accuracy across domains and key subtopics with 95% confidence intervals, [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy on higher-order (blue) and lower-order (green) questions for each [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Confidence-accuracy relationship by confidence level. (a) Distribution of correct [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: (a) Illustration of masking perturbation on a sample question stem. (b) Overall [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly applied in computer science education for tasks such as tutoring, content generation, and code assessment. However, systematic evaluations aligned with formal curricula and certification standards remain limited. This study benchmarked four recent models, including GPT-5, DeepSeek-R1, Qwen-Plus, and Llama-3.3-70B-Instruct, using a dataset of 1,068 questions derived from six certification exams covering networking, office applications, and Java programming. We evaluated performance across language (Chinese vs. English), cognitive levels based on Bloom's Taxonomy, domain knowledge, confidence-accuracy alignment, and robustness to input masking. Results showed that GPT-5 performed best on English-language certifications, while Qwen-Plus performed better in Chinese contexts. DeepSeek-R1 achieved the most balanced cross-lingual performance, whereas Llama-3.3 showed clear limitations in higher-order reasoning and robustness. All models performed worse on more complex tasks. These findings provide empirical support for the integration of LLMs into computer science education and offer practical implications for curriculum design and assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates four LLMs (GPT-5, DeepSeek-R1, Qwen-Plus, Llama-3.3-70B-Instruct) on 1,068 questions drawn from six professional certification exams covering networking, office applications, and Java programming. It measures performance across English/Chinese languages, Bloom's Taxonomy cognitive levels, domains, confidence-accuracy alignment, and robustness to input masking. Reported results indicate GPT-5 leads on English certifications, Qwen-Plus on Chinese, DeepSeek-R1 shows balanced cross-lingual performance, Llama-3.3 is weakest on higher-order reasoning, and all models decline on more complex tasks. The authors conclude that the findings provide empirical support for integrating LLMs into computer science education.

Significance. If the certification questions prove representative of CS education goals and the Bloom's annotations are reliable, the cross-lingual and cognitive-level benchmark supplies useful empirical data on current LLM capabilities in applied domains, with potential implications for curriculum design and assessment. The multi-dimensional protocol is a positive contribution to LLM evaluation in education.

major comments (3)
  1. [Abstract and Methods] Abstract and Methods: The description of the 1,068-question dataset provides no details on sourcing criteria, selection process, or any mapping to standard CS curricula. This is load-bearing for the central claim that results support LLM integration into CS education, because professional certifications often prioritize procedural knowledge over the theoretical abstraction and design skills emphasized in university courses.
  2. [Results (Bloom's Taxonomy analysis)] Results (Bloom's Taxonomy analysis): No inter-annotator agreement, expert validation, or reliability metrics are reported for the assignment of Bloom's Taxonomy levels to the questions. Without this, the key finding that models perform worse on higher-order tasks cannot be confidently distinguished from possible annotation artifacts.
  3. [Results and Discussion] Results and Discussion: Performance comparisons (model rankings, language/domain differences, and the decline on complex tasks) are presented without statistical tests, confidence intervals, or error analysis. This weakens the ability to assess whether observed differences are robust or meaningful.
minor comments (2)
  1. [Abstract] Abstract: The six certification exams are not named or cited, which reduces reproducibility and context for readers.
  2. [Methods] Consider adding a summary table breaking down the 1,068 questions by domain, language, and Bloom's level to improve clarity of the experimental setup.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has identified several areas where the manuscript can be strengthened. We address each major comment point by point below and commit to revisions that improve transparency, reliability, and statistical rigor without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract and Methods] The description of the 1,068-question dataset provides no details on sourcing criteria, selection process, or any mapping to standard CS curricula. This is load-bearing for the central claim that results support LLM integration into CS education, because professional certifications often prioritize procedural knowledge over the theoretical abstraction and design skills emphasized in university courses.

    Authors: We appreciate this observation and agree that greater detail on dataset construction is warranted. The 1,068 questions were drawn from publicly available practice exams and official preparation materials for six professional certifications (CompTIA Network+, Cisco CCNA, Microsoft Office Specialist, Oracle Java Programmer, and two additional domain-specific exams). Selection followed a stratified sampling approach to ensure balance across domains, languages, and difficulty. While these certifications emphasize applied skills, they map to core CS topics such as network protocols, software development, and productivity tools that appear in standard ACM/IEEE curricula. In the revised manuscript, we will expand the Methods section with explicit sourcing criteria, the full selection protocol, and a dedicated discussion of alignment (including both the practical relevance and the acknowledged limitations relative to theoretical design skills). This will better support the claim regarding LLM integration into CS education. revision: yes

  2. Referee: [Results (Bloom's Taxonomy analysis)] No inter-annotator agreement, expert validation, or reliability metrics are reported for the assignment of Bloom's Taxonomy levels to the questions. Without this, the key finding that models perform worse on higher-order tasks cannot be confidently distinguished from possible annotation artifacts.

    Authors: We agree that reliability metrics are necessary to substantiate the Bloom's Taxonomy findings. The levels were assigned using the standard six-category framework, with two authors performing independent annotations followed by discussion to resolve disagreements. In the revised version, we will add a description of the annotation protocol in the Methods section and report inter-annotator agreement via Cohen's kappa. We will also note any limitations and, where possible, include a brief expert validation step to further mitigate concerns about annotation artifacts. revision: yes

  3. Referee: [Results and Discussion] Performance comparisons (model rankings, language/domain differences, and the decline on complex tasks) are presented without statistical tests, confidence intervals, or error analysis. This weakens the ability to assess whether observed differences are robust or meaningful.

    Authors: We concur that the lack of statistical support weakens the interpretability of the reported differences. The revised Results and Discussion sections will include 95% bootstrap confidence intervals for all accuracy figures, appropriate statistical tests (McNemar's test for paired model comparisons and chi-square tests for differences across languages, domains, and cognitive levels), and a dedicated error analysis subsection addressing potential sources of variance such as question phrasing and model-specific failure modes. These additions will allow readers to evaluate the robustness of the observed patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark on external certification data

full rationale

The paper reports direct performance measurements of four LLMs on a fixed, externally sourced dataset of 1,068 questions drawn from six professional certification exams. All reported results (accuracy by language, by Bloom's cognitive level, by domain, confidence calibration, and robustness to masking) are computed from model outputs against ground-truth answers without any fitted parameters, derived predictions, or self-referential equations. Bloom's Taxonomy is applied as an external classification scheme; no derivation or uniqueness claim reduces to the authors' own prior work. No load-bearing self-citations or ansatzes are present in the central claims. The evaluation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that certification exams validly represent CS education and that Bloom's Taxonomy provides a reliable cognitive-level lens; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Bloom's Taxonomy provides a valid and applicable categorization of cognitive levels for CS certification exam questions
    Invoked to stratify performance results across cognitive complexity.

pith-pipeline@v0.9.0 · 5548 in / 1165 out tokens · 32105 ms · 2026-05-10T17:51:41.458804+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    Language Models (Mostly) Know What They Know

    Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 . Kang, E., Kim, J., 2025. When language shapes thought: Cross-lingual transfer of factual knowledge in question answering, in: Proceedings of the 34th ACM International Conference on Information and Knowledge Management, pp. 4868–4873. Li, Y., Xin, J., Miao, M.M., Long, Q., Ung...

  2. [2]

    Large language models for education: A survey and outlook

    Large language models for education: A survey and outlook. arXiv preprint arXiv:2403.18105 . Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al., 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, 24824–24837. Xiao, D., Gao, C., Luo, Z., ...

  3. [3]

    Code logic and output prediction

  4. [4]

    Syntax and pe r f or m a n ce issue i d e n t i f i c a t i o n

  5. [5]

    Core concept u n d e r s t a n d i n g ( multithreading , collections , OOP )

  6. [6]

    Network Expert: You are a network expert answering choice - based questions

    Complex scenario analysis ( inheritance , exceptions , generics ) Please carefully analyze the question and provide accurate answers . Network Expert: You are a network expert answering choice - based questions . Please focus on :

  7. [7]

    Network protocols and standards

  8. [8]

    Routing and switching t e c h n o l o g i e s

  9. [9]

    Network security and t r o u b l e s h o o t i n g

  10. [10]

    Office Software Proficient Clerk: You are an office worker who is proficient in office software a p p l i c a t i o n s and possesses extensive knowledge in this field

    Network design and o p t i m i z a t i o n Please provide accurate answers based on your expertise . Office Software Proficient Clerk: You are an office worker who is proficient in office software a p p l i c a t i o n s and possesses extensive knowledge in this field . 32 You have many years of practical experience and are well - versed in various office...

  11. [11]

    a n s w e r l e t t e r

    Provide the Answer ( Field name : " a n s w e r l e t t e r ") : - Choose the correct option letter ( s ) . - For { q u e s t i o n _ t y p e } - choice questions , return { a n s w e r _ f o r m a t }. - Use uppercase letters . - Single - choice : return exactly ONE uppercase letter . 33 - Multiple - choice : there are at least TWO correct options . Eval...

  12. [12]

    reasoning

    Provide Ex p l a na t i o n ( Field name : " reasoning ") : - Keep e x pl a n a ti o n concise ( max 100 words ) . - Focus on critical reasoning points . - Use bullet points or numbered lists . - Provide e x p la n a t io n in { o u t p u t _ l a n g u a g e }

  13. [13]

    c o n f i d e n c e _ a n s w e r _ l i k e r t

    Answer Confidence Rating ( Field name : " c o n f i d e n c e _ a n s w e r _ l i k e r t ") : - Rate confidence (1 -5) : 1 = No confidence , 5 = Highly confident

  14. [14]

    c l a s s i f i c a t i o n

    Question C l a s s i f i c a t i o n ( Field name : " c l a s s i f i c a t i o n ") : - Classify as " Lower - order " or " Higher - order "

  15. [15]

    c o n f i d e n c e _ c l a s s i f i c a t i o n _ l i k e r t

    C l a s s i f i c a t i o n Confidence Rating ( Field name : " c o n f i d e n c e _ c l a s s i f i c a t i o n _ l i k e r t ") : - Rate confidence in c l a s s i f i c a t i o n (1 -5) Output R e q u i r e m e n t s :

  16. [17]

    NO Markdown , code blocks , or quotes

  17. [18]

    All text in { o u t p u t _ l a n g u a g e }

  18. [19]

    Keep e x p l an a t i on under 100 words

  19. [20]

    a n s w e r l e t t e r

    Use bullet points for clarity Example JSON Format ( Single - choice ) : { " a n s w e r l e t t e r ": " A " , " reasoning ": " Key point 1; Key point 2" , " c o n f i d e n c e _ a n s w e r _ l i k e r t ": "5" , " c l a s s i f i c a t i o n ": " Higher - order " , " c o n f i d e n c e _ c l a s s i f i c a t i o n _ l i k e r t ": "5" } 34 Example JS...

  20. [21]

    Translate all { s o u r c e _ l a n g u a g e } content into { t a r g e t _ l a n g u a g e }

  21. [22]

    Maintain the original format and structure

  22. [23]

    Keep column names unchanged ({ c o l u m n _ n a m e s })

  23. [24]

    Ensure t r a n s l a t i o n s are accurate , natural , and conform to { t a r g e t _ l a n g u a g e } expression c on v e nt i o n s

  24. [25]

    Use standard technical t e rm i n o lo g y for p ro g r a mm i n g terms

  25. [26]

    Keep code sections unchanged , only translate text d e s c r i p t i o n s

  26. [27]

    Ensure t r a n s l a t i o n s are rigorous , professional , and preserve the original meaning

  27. [28]

    Do not return any answers , explanations , or comments 35

  28. [29]

    Do not add any ex p l an a t o ry text in pa r e n th e s e s

  29. [30]

    Chi- nese

    Perform pure t r an s l a ti o n only , do not provide any additional i nf o r m at i o n Note:{source_language}and{target_language}are set to "Chi- nese" or "English" based on automatic language detection.{column_names} is replaced with actual column names from the dataset. Appendix A.4. Difficulty Rating Prompt The following prompt was used to evaluate ...

  30. [31]

    Output ONLY a JSON object in message . content

  31. [32]

    d i f f i c u l t y _ l i k e r t

    Use exactly one field : " d i f f i c u l t y _ l i k e r t " with a value in {"1" ,"2" ,"3" ,"4" ,"5"}

  32. [33]

    d i f f i c u l t y _ l i k e r t

    No markdown , no extra fields , no e x p la n a ti o n Example : {" d i f f i c u l t y _ l i k e r t ":"3"} Your thinking process will be captured separately in r e a s o n i n g _ c o n t e n t . Note:The prompt enforces strict JSON output for automated parsing. Domain-specific expert roles from Section A.1 improve evaluation accuracy. Appendix B. Suppl...