Are LLMs Ready for Computer Science Education? A Cross-Domain, Cross-Lingual and Cognitive-Level Evaluation Using Professional Certification Exams
Pith reviewed 2026-05-10 17:51 UTC · model grok-4.3
The pith
LLMs show partial readiness for computer science education with strengths in basic tasks but limitations in complex reasoning and language consistency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Benchmarking GPT-5, DeepSeek-R1, Qwen-Plus, and Llama-3.3-70B-Instruct on certification exams shows GPT-5 leading in English, Qwen-Plus in Chinese, DeepSeek-R1 with the best balance across languages, and Llama-3.3 with notable weaknesses in higher-order reasoning and robustness to input changes. All models scored lower on tasks requiring more complex cognitive skills.
What carries the argument
Systematic testing of LLMs on real certification exam questions grouped by cognitive complexity and language.
If this is right
- LLMs can support basic-level computer science instruction and content generation.
- Model choice for educational applications should consider the primary language of instruction.
- Advanced courses emphasizing analysis and evaluation require human oversight when using LLMs.
- Robustness issues in some models suggest careful testing before deployment in assessments.
Where Pith is reading between the lines
- Curriculum developers might use these benchmarks to decide which model to integrate for specific course levels.
- Similar testing frameworks could apply to evaluating LLMs in other educational domains like mathematics or engineering.
- Addressing performance gaps in non-English languages could broaden the utility of these tools globally.
Load-bearing premise
That performance on certification exam questions reflects how well an LLM can aid computer science learning.
What would settle it
A study tracking actual student exam scores when LLMs are used as study aids compared to traditional study methods.
Figures
read the original abstract
Large language models (LLMs) are increasingly applied in computer science education for tasks such as tutoring, content generation, and code assessment. However, systematic evaluations aligned with formal curricula and certification standards remain limited. This study benchmarked four recent models, including GPT-5, DeepSeek-R1, Qwen-Plus, and Llama-3.3-70B-Instruct, using a dataset of 1,068 questions derived from six certification exams covering networking, office applications, and Java programming. We evaluated performance across language (Chinese vs. English), cognitive levels based on Bloom's Taxonomy, domain knowledge, confidence-accuracy alignment, and robustness to input masking. Results showed that GPT-5 performed best on English-language certifications, while Qwen-Plus performed better in Chinese contexts. DeepSeek-R1 achieved the most balanced cross-lingual performance, whereas Llama-3.3 showed clear limitations in higher-order reasoning and robustness. All models performed worse on more complex tasks. These findings provide empirical support for the integration of LLMs into computer science education and offer practical implications for curriculum design and assessment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates four LLMs (GPT-5, DeepSeek-R1, Qwen-Plus, Llama-3.3-70B-Instruct) on 1,068 questions drawn from six professional certification exams covering networking, office applications, and Java programming. It measures performance across English/Chinese languages, Bloom's Taxonomy cognitive levels, domains, confidence-accuracy alignment, and robustness to input masking. Reported results indicate GPT-5 leads on English certifications, Qwen-Plus on Chinese, DeepSeek-R1 shows balanced cross-lingual performance, Llama-3.3 is weakest on higher-order reasoning, and all models decline on more complex tasks. The authors conclude that the findings provide empirical support for integrating LLMs into computer science education.
Significance. If the certification questions prove representative of CS education goals and the Bloom's annotations are reliable, the cross-lingual and cognitive-level benchmark supplies useful empirical data on current LLM capabilities in applied domains, with potential implications for curriculum design and assessment. The multi-dimensional protocol is a positive contribution to LLM evaluation in education.
major comments (3)
- [Abstract and Methods] Abstract and Methods: The description of the 1,068-question dataset provides no details on sourcing criteria, selection process, or any mapping to standard CS curricula. This is load-bearing for the central claim that results support LLM integration into CS education, because professional certifications often prioritize procedural knowledge over the theoretical abstraction and design skills emphasized in university courses.
- [Results (Bloom's Taxonomy analysis)] Results (Bloom's Taxonomy analysis): No inter-annotator agreement, expert validation, or reliability metrics are reported for the assignment of Bloom's Taxonomy levels to the questions. Without this, the key finding that models perform worse on higher-order tasks cannot be confidently distinguished from possible annotation artifacts.
- [Results and Discussion] Results and Discussion: Performance comparisons (model rankings, language/domain differences, and the decline on complex tasks) are presented without statistical tests, confidence intervals, or error analysis. This weakens the ability to assess whether observed differences are robust or meaningful.
minor comments (2)
- [Abstract] Abstract: The six certification exams are not named or cited, which reduces reproducibility and context for readers.
- [Methods] Consider adding a summary table breaking down the 1,068 questions by domain, language, and Bloom's level to improve clarity of the experimental setup.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has identified several areas where the manuscript can be strengthened. We address each major comment point by point below and commit to revisions that improve transparency, reliability, and statistical rigor without altering the core findings.
read point-by-point responses
-
Referee: [Abstract and Methods] The description of the 1,068-question dataset provides no details on sourcing criteria, selection process, or any mapping to standard CS curricula. This is load-bearing for the central claim that results support LLM integration into CS education, because professional certifications often prioritize procedural knowledge over the theoretical abstraction and design skills emphasized in university courses.
Authors: We appreciate this observation and agree that greater detail on dataset construction is warranted. The 1,068 questions were drawn from publicly available practice exams and official preparation materials for six professional certifications (CompTIA Network+, Cisco CCNA, Microsoft Office Specialist, Oracle Java Programmer, and two additional domain-specific exams). Selection followed a stratified sampling approach to ensure balance across domains, languages, and difficulty. While these certifications emphasize applied skills, they map to core CS topics such as network protocols, software development, and productivity tools that appear in standard ACM/IEEE curricula. In the revised manuscript, we will expand the Methods section with explicit sourcing criteria, the full selection protocol, and a dedicated discussion of alignment (including both the practical relevance and the acknowledged limitations relative to theoretical design skills). This will better support the claim regarding LLM integration into CS education. revision: yes
-
Referee: [Results (Bloom's Taxonomy analysis)] No inter-annotator agreement, expert validation, or reliability metrics are reported for the assignment of Bloom's Taxonomy levels to the questions. Without this, the key finding that models perform worse on higher-order tasks cannot be confidently distinguished from possible annotation artifacts.
Authors: We agree that reliability metrics are necessary to substantiate the Bloom's Taxonomy findings. The levels were assigned using the standard six-category framework, with two authors performing independent annotations followed by discussion to resolve disagreements. In the revised version, we will add a description of the annotation protocol in the Methods section and report inter-annotator agreement via Cohen's kappa. We will also note any limitations and, where possible, include a brief expert validation step to further mitigate concerns about annotation artifacts. revision: yes
-
Referee: [Results and Discussion] Performance comparisons (model rankings, language/domain differences, and the decline on complex tasks) are presented without statistical tests, confidence intervals, or error analysis. This weakens the ability to assess whether observed differences are robust or meaningful.
Authors: We concur that the lack of statistical support weakens the interpretability of the reported differences. The revised Results and Discussion sections will include 95% bootstrap confidence intervals for all accuracy figures, appropriate statistical tests (McNemar's test for paired model comparisons and chi-square tests for differences across languages, domains, and cognitive levels), and a dedicated error analysis subsection addressing potential sources of variance such as question phrasing and model-specific failure modes. These additions will allow readers to evaluate the robustness of the observed patterns. revision: yes
Circularity Check
No circularity: pure empirical benchmark on external certification data
full rationale
The paper reports direct performance measurements of four LLMs on a fixed, externally sourced dataset of 1,068 questions drawn from six professional certification exams. All reported results (accuracy by language, by Bloom's cognitive level, by domain, confidence calibration, and robustness to masking) are computed from model outputs against ground-truth answers without any fitted parameters, derived predictions, or self-referential equations. Bloom's Taxonomy is applied as an external classification scheme; no derivation or uniqueness claim reduces to the authors' own prior work. No load-bearing self-citations or ansatzes are present in the central claims. The evaluation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Bloom's Taxonomy provides a valid and applicable categorization of cognitive levels for CS certification exam questions
Reference graph
Works this paper leans on
-
[1]
Language Models (Mostly) Know What They Know
Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 . Kang, E., Kim, J., 2025. When language shapes thought: Cross-lingual transfer of factual knowledge in question answering, in: Proceedings of the 34th ACM International Conference on Information and Knowledge Management, pp. 4868–4873. Li, Y., Xin, J., Miao, M.M., Long, Q., Ung...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Large language models for education: A survey and outlook
Large language models for education: A survey and outlook. arXiv preprint arXiv:2403.18105 . Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al., 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, 24824–24837. Xiao, D., Gao, C., Luo, Z., ...
-
[3]
Code logic and output prediction
-
[4]
Syntax and pe r f or m a n ce issue i d e n t i f i c a t i o n
-
[5]
Core concept u n d e r s t a n d i n g ( multithreading , collections , OOP )
-
[6]
Network Expert: You are a network expert answering choice - based questions
Complex scenario analysis ( inheritance , exceptions , generics ) Please carefully analyze the question and provide accurate answers . Network Expert: You are a network expert answering choice - based questions . Please focus on :
-
[7]
Network protocols and standards
-
[8]
Routing and switching t e c h n o l o g i e s
-
[9]
Network security and t r o u b l e s h o o t i n g
-
[10]
Network design and o p t i m i z a t i o n Please provide accurate answers based on your expertise . Office Software Proficient Clerk: You are an office worker who is proficient in office software a p p l i c a t i o n s and possesses extensive knowledge in this field . 32 You have many years of practical experience and are well - versed in various office...
-
[11]
Provide the Answer ( Field name : " a n s w e r l e t t e r ") : - Choose the correct option letter ( s ) . - For { q u e s t i o n _ t y p e } - choice questions , return { a n s w e r _ f o r m a t }. - Use uppercase letters . - Single - choice : return exactly ONE uppercase letter . 33 - Multiple - choice : there are at least TWO correct options . Eval...
- [12]
-
[13]
c o n f i d e n c e _ a n s w e r _ l i k e r t
Answer Confidence Rating ( Field name : " c o n f i d e n c e _ a n s w e r _ l i k e r t ") : - Rate confidence (1 -5) : 1 = No confidence , 5 = Highly confident
-
[14]
Question C l a s s i f i c a t i o n ( Field name : " c l a s s i f i c a t i o n ") : - Classify as " Lower - order " or " Higher - order "
-
[15]
c o n f i d e n c e _ c l a s s i f i c a t i o n _ l i k e r t
C l a s s i f i c a t i o n Confidence Rating ( Field name : " c o n f i d e n c e _ c l a s s i f i c a t i o n _ l i k e r t ") : - Rate confidence in c l a s s i f i c a t i o n (1 -5) Output R e q u i r e m e n t s :
-
[17]
NO Markdown , code blocks , or quotes
-
[18]
All text in { o u t p u t _ l a n g u a g e }
-
[19]
Keep e x p l an a t i on under 100 words
-
[20]
Use bullet points for clarity Example JSON Format ( Single - choice ) : { " a n s w e r l e t t e r ": " A " , " reasoning ": " Key point 1; Key point 2" , " c o n f i d e n c e _ a n s w e r _ l i k e r t ": "5" , " c l a s s i f i c a t i o n ": " Higher - order " , " c o n f i d e n c e _ c l a s s i f i c a t i o n _ l i k e r t ": "5" } 34 Example JS...
-
[21]
Translate all { s o u r c e _ l a n g u a g e } content into { t a r g e t _ l a n g u a g e }
-
[22]
Maintain the original format and structure
-
[23]
Keep column names unchanged ({ c o l u m n _ n a m e s })
-
[24]
Ensure t r a n s l a t i o n s are accurate , natural , and conform to { t a r g e t _ l a n g u a g e } expression c on v e nt i o n s
-
[25]
Use standard technical t e rm i n o lo g y for p ro g r a mm i n g terms
-
[26]
Keep code sections unchanged , only translate text d e s c r i p t i o n s
-
[27]
Ensure t r a n s l a t i o n s are rigorous , professional , and preserve the original meaning
-
[28]
Do not return any answers , explanations , or comments 35
-
[29]
Do not add any ex p l an a t o ry text in pa r e n th e s e s
-
[30]
Perform pure t r an s l a ti o n only , do not provide any additional i nf o r m at i o n Note:{source_language}and{target_language}are set to "Chi- nese" or "English" based on automatic language detection.{column_names} is replaced with actual column names from the dataset. Appendix A.4. Difficulty Rating Prompt The following prompt was used to evaluate ...
-
[31]
Output ONLY a JSON object in message . content
-
[32]
d i f f i c u l t y _ l i k e r t
Use exactly one field : " d i f f i c u l t y _ l i k e r t " with a value in {"1" ,"2" ,"3" ,"4" ,"5"}
-
[33]
d i f f i c u l t y _ l i k e r t
No markdown , no extra fields , no e x p la n a ti o n Example : {" d i f f i c u l t y _ l i k e r t ":"3"} Your thinking process will be captured separately in r e a s o n i n g _ c o n t e n t . Note:The prompt enforces strict JSON output for automated parsing. Domain-specific expert roles from Section A.1 improve evaluation accuracy. Appendix B. Suppl...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.