arxiv: 2605.12748 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.AI· cs.CY· cs.LG

Recognition: no theorem link

Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators

Heejin Do , Shashank Sonkar , Mrinmaya Sachan

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.LG

keywords LLM simulatorsmisconception faithfulnessselective flip scoresycophantic behaviorstudent modelingAI tutorsbelief updatingfeedback protocols

0 comments

The pith

LLM student simulators correct answers at similar rates whether feedback targets the actual misconception or not.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a controlled test to check whether AI models acting as simulated students maintain a consistent misconception across interactions. It contrasts targeted feedback that addresses the underlying error against misaligned feedback about a different error and generic feedback that only says the answer is wrong. The central result is that seven different LLMs produce near-zero Selective Flip Scores, flipping their answers at roughly the same high rate in all three conditions. This pattern indicates the models behave like sycophantic problem-solvers that treat any corrective signal as a reason to abandon the simulated response and recompute from their own knowledge. The authors further show that a training pipeline combining supervised fine-tuning, preference optimization, and reinforcement learning with an SFS-aligned reward can raise the score substantially.

Core claim

Across seven LLMs from 4B to 120B parameters, multiple datasets, and prompting strategies, simulators exhibit near-zero SFS: they correct their answers at similarly high rates under targeted misconception feedback as under misaligned or generic feedback. This reveals a sycophantic failure mode in which models do not maintain a misconception-driven belief state but instead treat any corrective signal as a cue to re-solve the problem from internal knowledge.

What carries the argument

Selective Flip Score (SFS), a metric that measures how much more often a simulator flips its answer under targeted feedback than under misaligned or generic controls.

If this is right

Current LLM simulators cannot be trusted to train AI tutors because they do not replicate the selective belief-updating behavior of real students.
Models treat corrective signals as generic cues to abandon the simulated answer and recompute from internal knowledge.
A post-training pipeline of supervised fine-tuning followed by SFS-aligned reinforcement learning raises SFS, with SFT yielding gains up to +0.56 and RL providing more consistent improvement than preference optimization alone.
Misconception faithfulness is a trainable property rather than an inherent limit of current model sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If simulators lack persistent internal belief states, multi-turn tutoring interactions may collapse into generic correction loops rather than genuine misconception diagnosis.
The same sycophantic pattern could appear in other role-play settings where the model is asked to maintain a fixed persona or knowledge state across turns.
Extending the contrastive-feedback protocol to measure belief persistence over longer dialogues would test whether the observed failure is limited to single-turn corrections.

Load-bearing premise

The misconception-contrastive feedback protocol cleanly isolates whether a simulator maintains a misconception-driven belief state rather than other response patterns.

What would settle it

An LLM that produces a substantially positive SFS by flipping answers far more often under targeted feedback than under either misaligned or generic feedback would directly contradict the near-zero result.

Figures

Figures reproduced from arXiv: 2605.12748 by Heejin Do, Mrinmaya Sachan, Shashank Sonkar.

**Figure 1.** Figure 1: Diagnostic framework for misconception faithfulness via misconception-contrastive feedback. Given a problem q with incorrect answer aw arising from misconception m, we evaluate simulator behavior under three feedback conditions: targeted feedback fT addressing m, misaligned feedback fM targeting a different plausible misconception m′ ̸= m, and generic feedback fG indicating only that aw is incorrect. Top: … view at source ↗

**Figure 2.** Figure 2: Misconception-faithful student simulator optimization pipeline. To induce contrastive separation between aligned and misaligned updates, we apply DPO [23] on top of the SFT model. For each training instance, we construct preference pairs (y +, y−), where y + is a judge-verified response satisfying C ∗ (f), and y − is sampled from the SFT model’s outputs that fall outside C ∗ (f) (on-policy hard negatives).… view at source ↗

**Figure 3.** Figure 3: (a) Decomposition into content and specificity effects, (b) Relationship between model [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Multi-turn simulation results. Left: Firstturn reasoning quality (averaged coherence and alignment with the target misconception). Right: SFS in multi-turn vs. single-turn settings. Multi-turn reflection does not prevent sensitivity collapse. One possible explanation is that single-turn prompting is not sufficient to induce commitment to the simulated misconception. To test this, we introduce a multit… view at source ↗

**Figure 5.** Figure 5: Post-training results on the Malrule (top) and EEDI (bottom) datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: In-domain versus out-of-domain SFS results, training on Malrule (left) or EEDI (right). We further evaluate whether misconceptionfaithful behavior transfers across datasets by training on one dataset and evaluating on the other [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Behavioral pattern distribution. To better understand how training reshapes simulator behavior, we analyze the responsecategory distributions of the Qwen3-4B simulators (baseline, SFT) across feedback types ( [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Behavioral response distributions across feedback conditions for single-turn and multi [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

read the original abstract

Large language models (LLMs) can fluently generate student-like responses, making them attractive as simulated students for training and evaluating AI tutors and human educators. Yet such simulators are typically evaluated by output similarity to real students, not by whether they behave like students with coherent misconceptions during interaction. We introduce a controlled framework for evaluating misconception faithfulness, whether a simulator maintains a misconception-driven belief state and updates selectively when feedback addresses the underlying misconception. Central to our framework is a misconception-contrastive feedback protocol that compares targeted feedback against two controls: misaligned feedback (targeting a different but plausible misconception) and generic feedback (only identifying answer is wrong). We propose Selective Flip Score (SFS), which quantifies how much more often a simulator flips its answer under targeted feedback than under contrastive controls. Across seven LLMs (4B-120B), multiple datasets, and prompting strategies, simulators exhibit near-zero SFS, correcting their answers at similarly high rates regardless of feedback relevance. Further analyses reveal a sycophantic failure mode: models behave less like students with misconceptions but more like problem-solvers who treat any corrective signal as a cue to abandon the simulated belief and re-solve from internal knowledge. To address this, we develop a post-training pipeline spanning supervised fine-tuning (SFT), preference optimization, and reinforcement learning (RL) with an SFS-aligned reward; SFT yields notable gains up to +0.56, and SFS-aligned RL provides more consistent improvements than preference optimization. Our results establish misconception faithfulness as a challenging yet trainable property, motivating a shift from static output matching toward interactive, belief-aware student modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM student simulators flip answers at similar rates on any feedback instead of updating selectively on relevant corrections, but the new SFS metric and training pipeline are worth testing.

read the letter

LLM student simulators flip answers at similar rates whether the feedback targets their supposed misconception or not. That is the core finding, and it suggests they are not maintaining student-like belief states during interaction. The paper introduces a new evaluation framework built around a misconception-contrastive feedback protocol and the Selective Flip Score to quantify selective updating. They apply it to seven models and several datasets, with consistent near-zero results. The post-training section shows that supervised fine-tuning and RL with an SFS reward can improve the score noticeably. The main limitation is that we only have the abstract-level summary here, so the details on how they constructed the feedback types and whether statistical tests confirm the differences matter. The protocol seems reasonable on paper, but it could be sensitive to how the misconceptions are chosen. This work is aimed at people developing AI tutors and simulated students. It deserves peer review because it identifies a clear failure mode and demonstrates that it can be mitigated with standard techniques. I would cite the metric if I were evaluating simulators myself.

Referee Report

2 major / 2 minor

Summary. The paper introduces a misconception-contrastive feedback protocol (targeted vs. misaligned vs. generic) and Selective Flip Score (SFS) to measure whether LLM student simulators maintain coherent misconception-driven belief states. It reports near-zero SFS across seven LLMs (4B-120B), multiple datasets, and prompting strategies, with models correcting answers at similarly high rates regardless of feedback relevance; this is interpreted as sycophantic problem-solving. The authors further present a post-training pipeline (SFT, preference optimization, and RL with SFS-aligned reward) that yields SFS gains up to +0.56.

Significance. If the central empirical finding holds, the work identifies a fundamental limitation in current LLM simulators for educational applications and demonstrates that misconception faithfulness is a trainable property rather than an inherent flaw. The controlled protocol, the SFS metric, and the reproducible post-training pipeline (particularly the SFS-aligned RL component) provide concrete tools for future research, shifting evaluation from static output similarity to interactive belief maintenance.

major comments (2)

[§4 (Results and Analysis)] §4 (Results and Analysis): The central claim of near-zero SFS and indistinguishable correction rates across feedback conditions requires statistical support (e.g., paired significance tests or confidence intervals on flip-rate differences) to establish that the observed similarity is not due to sampling variability; without these, the interpretation as sycophancy remains suggestive rather than definitive.
[§3 (Misconception-Contrastive Feedback Protocol)] §3 (Misconception-Contrastive Feedback Protocol): The protocol's validity as an isolator of belief-state maintenance hinges on misaligned feedback never inadvertently addressing the original misconception; the manuscript should include explicit construction rules and validation examples per dataset to rule out leakage that could artifactually produce low SFS.

minor comments (2)

[Abstract and §3] The abstract and methods should explicitly list the datasets and prompting strategies used, as the claim of consistency 'across multiple datasets' cannot be evaluated without this information.
[§4] Figure captions and result tables should report exact sample sizes per condition and model to allow readers to assess the reliability of the near-zero SFS values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the statistical rigor and protocol transparency of our work. We address each major comment below.

read point-by-point responses

Referee: The central claim of near-zero SFS and indistinguishable correction rates across feedback conditions requires statistical support (e.g., paired significance tests or confidence intervals on flip-rate differences) to establish that the observed similarity is not due to sampling variability; without these, the interpretation as sycophancy remains suggestive rather than definitive.

Authors: We agree that adding statistical support will make the claims more definitive. In the revised §4, we will include paired significance tests (McNemar's test for binary flip outcomes and paired t-tests on per-model flip-rate differences) together with 95% confidence intervals on the targeted-minus-control differences. These tests will be applied across all seven models, datasets, and prompting strategies to confirm that the near-zero SFS values are not explained by sampling variability. revision: yes
Referee: The protocol's validity as an isolator of belief-state maintenance hinges on misaligned feedback never inadvertently addressing the original misconception; the manuscript should include explicit construction rules and validation examples per dataset to rule out leakage that could artifactually produce low SFS.

Authors: We will add a dedicated subsection in §3 detailing the construction rules for misaligned feedback (selecting a distinct misconception from the same dataset's misconception inventory that shares no lexical or conceptual overlap with the target misconception) and will include one validation example per dataset showing the original misconception, the misaligned feedback, and a brief human verification that the misaligned feedback does not resolve the original misconception. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The evaluation framework rests on the domain assumption that LLMs can be prompted to simulate students holding coherent misconceptions and that answer changes under feedback reflect belief-state updates.

axioms (1)

domain assumption LLMs prompted as students maintain a misconception-driven belief state that can be selectively updated by relevant feedback
This is the core premise tested by the misconception-contrastive feedback protocol.

invented entities (2)

Selective Flip Score (SFS) no independent evidence
purpose: Quantify selective answer flipping under targeted versus control feedback
Newly defined metric to measure misconception faithfulness.
misconception-contrastive feedback protocol no independent evidence
purpose: Compare targeted feedback against misaligned and generic controls
New protocol introduced to isolate belief-state maintenance.

pith-pipeline@v0.9.0 · 5618 in / 1328 out tokens · 51572 ms · 2026-05-14T20:42:17.903145+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 6 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Cognitive tutors: Lessons learned.The journal of the learning sciences, 4(2):167–207, 1995

John R Anderson, Albert T Corbett, Kenneth R Koedinger, and Ray Pelletier. Cognitive tutors: Lessons learned.The journal of the learning sciences, 4(2):167–207, 1995

work page 1995
[3]

Order and equivalence of rational numbers: A clinical teaching experiment.Journal for Research in Mathematics Education, 15(5):323–341, 1984

Merlyn J Behr, Ipke Wachsmuth, Thomas R Post, and Richard Lesh. Order and equivalence of rational numbers: A clinical teaching experiment.Journal for Research in Mathematics Education, 15(5):323–341, 1984

work page 1984
[4]

Diagnostic models for procedural bugs in basic mathematical skills.Cognitive science, 2(2):155–192, 1978

John Seely Brown and Richard R Burton. Diagnostic models for procedural bugs in basic mathematical skills.Cognitive science, 2(2):155–192, 1978

work page 1978
[5]

The impact of high school life science teachers’ subject matter knowledge and knowledge of student misconceptions on students’ learning.CBE—Life Sciences Education, 19(1):ar9, 2020

Chen Chen, Gerhard Sonnert, Philip M Sadler, and Susan Sunbury. The impact of high school life science teachers’ subject matter knowledge and knowledge of student misconceptions on students’ learning.CBE—Life Sciences Education, 19(1):ar9, 2020

work page 2020
[6]

Malrulelib: Large-scale executable miscon- ception reasoning with step traces for modeling student thinking in mathematics.arXiv preprint arXiv:2601.03217, 2026

Xinghe Chen, Naiming Liu, and Shashank Sonkar. Malrulelib: Large-scale executable miscon- ception reasoning with step traces for modeling student thinking in mathematics.arXiv preprint arXiv:2601.03217, 2026

work page arXiv 2026
[7]

Teaching by listening-toward a new day in math classes

J Al Easley and Russell E Zwoyer. Teaching by listening-toward a new day in math classes. Contemporary Education, 47(1):19, 1975

work page 1975
[8]

Divert: distractor generation with variational errors represented as text for math multiple-choice questions

Nigel Fernandez, Alexander Scarlatos, Wanyong Feng, Simon Woodhead, and Andrew Lan. Divert: distractor generation with variational errors represented as text for math multiple-choice questions. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9063–9081, 2024

work page 2024
[9]

CRITIC: Large language models can self-correct with tool-interactive critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen, Yujiu Yang, Nan Duan, and Weizhu Chen. CRITIC: Large language models can self-correct with tool-interactive critiquing. In The Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=Sx038qxjek

work page 2024
[10]

Autotutor

Arthur C Graesser, Sidney D’Mello, Xiangen Hu, Zhiqiang Cai, Andrew Olney, and Brent Morgan. Autotutor. InApplied natural language processing: Identification, investigation and resolution, pages 169–187. IGI Global Scientific Publishing, 2012. 10

work page 2012
[11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Tanja Käser and Giora Alexandron. Simulated learners in educational technology: A systematic literature review and a turing-like test.International Journal of Artificial Intelligence in Education, 34(2):545–585, 2024

work page 2024
[13]

Eedi - mining misconceptions in mathematics

Jules King, L Burleigh, Simon Woodhead, Panagiota Kon, Perpetual Baffour, Scott Crossley, Walter Reade, and Maggie Demkin. Eedi - mining misconceptions in mathematics. https:// kaggle.com/competitions/eedi-mining-misconceptions-in-mathematics , 2024. Kaggle

work page 2024
[14]

Open-ended knowledge tracing for computer science education

Naiming Liu, Zichao Wang, Richard Baraniuk, and Andrew Lan. Open-ended knowledge tracing for computer science education. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

work page 2022
[15]

Novice learner and expert tutor: Evaluating math reasoning abilities of large language models with misconceptions.arXiv preprint arXiv:2310.02439, 2023

Naiming Liu, Shashank Sonkar, Zichao Wang, Simon Woodhead, and Richard G Baraniuk. Novice learner and expert tutor: Evaluating math reasoning abilities of large language models with misconceptions.arXiv preprint arXiv:2310.02439, 2023

work page arXiv 2023
[16]

Mathdial: A dialogue tutoring dataset with rich peda- gogical properties grounded in math reasoning problems

Jakub Macina, Nico Daheim, Sankalan Chowdhury, Tanmay Sinha, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. Mathdial: A dialogue tutoring dataset with rich peda- gogical properties grounded in math reasoning problems. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5602–5621, Singapore, 2023. Asso- ciation for Computationa...

work page doi:10.18653/v1/2023.findings-emnlp.372 2023
[17]

Gpteach: Interactive ta training with gpt-based students

Julia M Markel, Steven G Opferman, James A Landay, and Chris Piech. Gpteach: Interactive ta training with gpt-based students. InProceedings of the tenth acm conference on learning@ scale, pages 226–236, 2023

work page 2023
[18]

Towards a computational theory of algebraic competence.The Journal of Mathematical Behavior, 3(1):93–166, 1980

Marilyn Matz. Towards a computational theory of algebraic competence.The Journal of Mathematical Behavior, 3(1):93–166, 1980

work page 1980
[19]

Teaching and learning fraction and rational numbers: The origins and implications of whole number bias.Educational Psychologist, 40(1):27–52, 2005

Yujing Ni and Yong-Di Zhou. Teaching and learning fraction and rational numbers: The origins and implications of whole number bias.Educational Psychologist, 40(1):27–52, 2005

work page 2005
[20]

Tutorup: What if your students were simulated? training tutors to address engagement challenges in online learning

Sitong Pan, Robin Schmucker, Bernardo Garcia Bulle Bueno, Salome Aguilar Llanes, Fernanda Albo Alarcón, Hangxiao Zhu, Adam Teo, and Meng Xia. Tutorup: What if your students were simulated? training tutors to address engagement challenges in online learning. InProceedings of the 2025 CHI conference on human factors in computing systems, pages 1–18, 2025

work page 2025
[21]

Lookalike: Consistent distractor generation in math mcqs.arXiv preprint arXiv:2505.01903, 2025

Nisarg Parikh, Nigel Fernandez, Alexander Scarlatos, Simon Woodhead, and Andrew Lan. Lookalike: Consistent distractor generation in math mcqs.arXiv preprint arXiv:2505.01903, 2025

work page arXiv 2025
[22]

Deep knowledge tracing.Advances in neural information processing systems, 28, 2015

Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J Guibas, and Jascha Sohl-Dickstein. Deep knowledge tracing.Advances in neural information processing systems, 28, 2015

work page 2015
[23]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[24]

Conceptual bases of arithmetic errors: The case of decimal fractions.Journal for Research in Mathematics Education, 20(1):8–27, 1989

Lauren B Resnick, Pearla Nesher, François Leonard, Maria Magone, Susan Omanson, and Irit Peled. Conceptual bases of arithmetic errors: The case of decimal fractions.Journal for Research in Mathematics Education, 20(1):8–27, 1989

work page 1989
[25]

Learning to make mistakes: Modeling incorrect student thinking and key errors.arXiv preprint arXiv:2510.11502, 2025

Alexis Ross and Jacob Andreas. Learning to make mistakes: Modeling incorrect student thinking and key errors.arXiv preprint arXiv:2510.11502, 2025. 11

work page arXiv 2025
[26]

Exploring iterative controllable summarization with large language models

Sangwon Ryu, Heejin Do, Daehui Kim, Hwanjo Yu, Dongwoo Kim, Yunsu Kim, Gary Lee, and Jungseul Ok. Exploring iterative controllable summarization with large language models. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Findings of the Association for Computational Linguistics: EACL 2026, pages 512–528, Rabat, Morocco, March 2026. Association f...

work page doi:10.18653/v1/2026 2026
[27]

Simulated Students in Tutoring Dialogues: Substance or Illusion?

Alexander Scarlatos, Jaewook Lee, Simon Woodhead, and Andrew Lan. Simulated students in tutoring dialogues: Substance or illusion?arXiv preprint arXiv:2601.04025, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Early predictors of high school mathematics achievement.Psychological Science, 23(7):691–697, 2012

Robert S Siegler, Greg J Duncan, Pamela E Davis-Kean, Kathryn Duckworth, Amy Claessens, Mimi Engel, Maria Ines Susperreguy, and Meichu Chen. Early predictors of high school mathematics achievement.Psychological Science, 23(7):691–697, 2012

work page 2012
[31]

qdkt: Question-centric deep knowledge tracing.arXiv preprint arXiv:2005.12442, 2020

Shashank Sonkar, Andrew E Waters, Andrew S Lan, Phillip J Grimaldi, and Richard G Baraniuk. qdkt: Question-centric deep knowledge tracing.arXiv preprint arXiv:2005.12442, 2020

work page arXiv 2005
[32]

Llm-based cognitive models of students with misconceptions.arXiv preprint arXiv:2410.12294, 2024

Shashank Sonkar, Xinghe Chen, Naiming Liu, Richard G Baraniuk, and Mrinmaya Sachan. Llm-based cognitive models of students with misconceptions.arXiv preprint arXiv:2410.12294, 2024

work page arXiv 2024
[33]

Prompt chaining or stepwise prompt? refinement in text summarization

Shichao Sun, Ruifeng Yuan, Ziqiang Cao, Wenjie Li, and Pengfei Liu. Prompt chaining or stepwise prompt? refinement in text summarization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 7551–7558, 2024

work page 2024
[34]

Leveraging generative artificial intelligence to simulate student learning behavior.arXiv preprint arXiv:2310.19206, 2023

Songlin Xu and Xinyu Zhang. Leveraging generative artificial intelligence to simulate student learning behavior.arXiv preprint arXiv:2310.19206, 2023

work page arXiv 2023
[35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

absolute values cannot be negative

Yanick Zengaffinen, Andreas Opedal, Donya Rooein, Kv Aditya Srivatsa, Shashank Sonkar, and Mrinmaya Sachan. Can llms model incorrect student reasoning? a case study on distractor generation.arXiv preprint arXiv:2603.15547, 2026. A Additional Behavioral Analyses A.1 Qualitative Case Study Table 4 presents a representative failure case from Qwen3-80B-Thinki...

work page arXiv 2026