Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning

Hongyang He; Jiuming Liu; Victor Sanchez

arxiv: 2607.01511 · v1 · pith:ORKL3OZNnew · submitted 2026-07-01 · 💻 cs.AI · cs.LG

Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning

Hongyang He , Jiuming Liu , Victor Sanchez This is my paper

Pith reviewed 2026-07-03 20:02 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords chain-of-thoughtsemi-supervised learningpseudo-labelingsemantic entropyself-trainingreasoningunlabeled data

0 comments

The pith

Answer-level semantic entropy selects high-precision pseudo chain-of-thought chains from unlabeled questions for semi-supervised training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines semi-supervised chain-of-thought learning and introduces the Semi-CoT framework to turn unlabeled questions into pseudo reasoning supervision. It samples multiple pseudo-CoTs per question, computes answer-level semantic entropy, and retains only low-entropy chains as reliable demonstrations. Experiments on AQuA, SVAMP, GSM8K, and MultiArith show the entropy gate produces pseudo-answers with 91.36% to 100% precision. The work extends self-training ideas from inference-time refinement to direct use of unlabeled data as training signals, though observed accuracy gains remain small and sometimes negative.

Core claim

Semi-CoT samples multiple pseudo-CoTs for each unlabeled question, estimates answer-level semantic entropy, and selects low-entropy reasoning chains as reliable pseudo-CoT demonstrations, achieving pseudo-answer precision from 91.36% to 100% across AQuA, SVAMP, GSM8K, and MultiArith and thereby showing that unlabeled questions can supply usable reasoning supervision under this filter.

What carries the argument

The entropy gate: sampling multiple pseudo-CoTs per unlabeled question and retaining only those with low answer-level semantic entropy as pseudo-supervision.

Load-bearing premise

Low answer-level semantic entropy on the final answer serves as a reliable proxy that the full reasoning chain is correct.

What would settle it

An audit that finds many low-entropy pseudo-CoTs containing incorrect intermediate steps despite correct final answers would falsify the gate's reliability.

read the original abstract

Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent reasoning capabilities in large language models. However, most existing CoT methods use reasoning chains mainly as inference-time prompts, while the generated reasoning traces are rarely reused as semi-supervised learning signals. In this report, we define \textbf{Semi-supervised Chain-of-Thought Learning} and propose \textbf{Semi-CoT}, a simple framework that uses unlabeled questions to construct pseudo reasoning supervision. Semi-CoT samples multiple pseudo-CoTs for each unlabeled question, estimates answer-level semantic entropy, and selects low-entropy reasoning chains as reliable pseudo-CoT demonstrations. This extends the self-training view of CoT from inference-time refinement to semi-supervised pseudo-supervision. Pilot experiments on AQuA, SVAMP, GSM8K, and MultiArith show that the entropy gate selects high-precision pseudo-CoTs, with pseudo-answer precision ranging from $91.36\%$ to $100\%$. Semi-CoT also gives small gains on SVAMP and GSM8K, while AQuA shows negative transfer and MultiArith reaches a ceiling. These results suggest that unlabeled questions can provide reliable pseudo reasoning signals, but their effective use still requires stronger demonstration selection or student training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Semi-CoT applies answer-level entropy selection to pseudo CoTs and gets high precision on answers, but gains are small and reasoning steps remain unchecked.

read the letter

The main takeaway is that this paper samples multiple CoT traces per unlabeled question, computes answer-level semantic entropy, and keeps the low-entropy subset as extra training data. That specific combination of sampling plus entropy gating on final answers is not in the earlier CoT self-training papers they cite.

The work is straightforward and the pilot numbers support the narrow claim: the selected pseudo-answers are correct between 91 and 100 percent of the time across the four datasets. Reporting the negative transfer on AQuA is also honest and useful.

The soft spots are clear. The selection criterion only checks agreement on the final answer, so it can accept chains that reach the right number through invalid steps; nothing in the method inspects the intermediate reasoning. Downstream results are modest at best—small lifts on SVAMP and GSM8K, none or negative elsewhere—and the abstract gives no error bars, no statistical tests, and almost no detail on how many chains are drawn or how semantic entropy is actually calculated. Those gaps make the evidence preliminary.

This is for researchers already working on semi-supervised or self-training methods for LLM reasoning on math tasks. A reader who wants to test whether entropy can safely reduce labeled CoT data would get something concrete to try, but the paper does not claim or show a broad advance.

I would send it to peer review. The idea is simple enough to evaluate quickly and the reported precision numbers are worth a closer look even if the evaluation needs more rigor.

Referee Report

2 major / 2 minor

Summary. The paper defines Semi-supervised Chain-of-Thought Learning and proposes Semi-CoT, which samples multiple pseudo-CoTs per unlabeled question, thresholds on answer-level semantic entropy to select low-entropy chains as pseudo-supervision, and reports pilot results on AQuA, SVAMP, GSM8K and MultiArith showing 91.36–100% pseudo-answer precision together with small or mixed downstream gains.

Significance. If the selected chains supply verifiably high-quality reasoning traces rather than merely correct final answers, the framework would offer a practical route to semi-supervised CoT training. The current evidence remains preliminary and the significance is therefore limited until the reasoning-step quality assumption is directly tested.

major comments (2)

[method and pilot experiments] The selection procedure thresholds answer-level semantic entropy, which only certifies agreement on the final answer. No experiment or analysis checks whether the intermediate reasoning steps in the retained chains are valid; this assumption is load-bearing for the claim that the selected chains constitute reliable pseudo-CoT supervision (see the entropy-gate description and the pilot-experiment paragraph).
[pilot experiments] Downstream results are reported without error bars, statistical tests, or comparisons against standard self-training or CoT baselines; the small gains on SVAMP/GSM8K and negative transfer on AQuA therefore cannot be interpreted as evidence that the pseudo-CoT signal is effective.

minor comments (2)

[method] The exact procedure for computing semantic entropy (number of samples, clustering method, temperature) is not specified.
[abstract and experiments] The manuscript repeatedly refers to 'pilot experiments' yet presents the precision numbers as the headline result; clarify the scope and limitations of these runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our pilot study. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [method and pilot experiments] The selection procedure thresholds answer-level semantic entropy, which only certifies agreement on the final answer. No experiment or analysis checks whether the intermediate reasoning steps in the retained chains are valid; this assumption is load-bearing for the claim that the selected chains constitute reliable pseudo-CoT supervision (see the entropy-gate description and the pilot-experiment paragraph).

Authors: We agree that answer-level semantic entropy certifies final-answer agreement rather than step-by-step validity. Our pilot reports high pseudo-answer precision as evidence of selection quality, but does not include direct checks on reasoning-step correctness. We will revise the manuscript to explicitly state this assumption as a limitation and clarify that pseudo-CoT reliability is inferred from answer consistency. revision: yes
Referee: [pilot experiments] Downstream results are reported without error bars, statistical tests, or comparisons against standard self-training or CoT baselines; the small gains on SVAMP/GSM8K and negative transfer on AQuA therefore cannot be interpreted as evidence that the pseudo-CoT signal is effective.

Authors: The downstream numbers are from a small-scale pilot and lack error bars or statistical tests, limiting interpretability of the mixed gains. We will revise the text to emphasize the preliminary character of these results, focus primary claims on the observed pseudo-answer precision, and note that rigorous baseline comparisons are left for future work. revision: partial

Circularity Check

0 steps flagged

No circularity; entropy selection and precision measurement are independent

full rationale

The paper defines Semi-CoT by sampling multiple CoTs per unlabeled question, computing answer-level semantic entropy from those samples, and thresholding to select low-entropy chains. Precision is then measured by comparing the selected pseudo-answers to ground-truth labels on the evaluation sets. Because the entropy computation uses only model samples and the precision metric uses external labels never seen during selection, no equation or procedure reduces the reported 91.36–100 % figures to a quantity fitted on the same data. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The derivation is therefore self-contained empirical observation rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested assumption that semantic entropy over final answers correlates with reasoning-chain quality; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Low answer-level semantic entropy indicates high-quality reasoning chains suitable for use as pseudo-supervision
The method selects chains solely on this basis; if the correlation does not hold, the pseudo-labels are unreliable.

pith-pipeline@v0.9.1-grok · 5760 in / 1361 out tokens · 26954 ms · 2026-07-03T20:02:01.869720+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 8 canonical work pages · 3 internal anchors

[1]

Self- training: A survey.Neurocomputing, 616:128904, 2025

Massih-Reza Amini, Vasilii Feofanov, Loic Pauletto, Lies Hadjadj, Emilie Devijver, and Yury Maximov. Self- training: A survey.Neurocomputing, 616:128904, 2025

2025
[2]

Boosting the margin: A new explanation for the effectiveness of voting methods.The annals of statistics, 26(5):1651–1686, 1998

Peter Bartlett, Yoav Freund, Wee Sun Lee, and Robert E Schapire. Boosting the margin: A new explanation for the effectiveness of voting methods.The annals of statistics, 26(5):1651–1686, 1998

1998
[3]

Debiased self-training for semi-supervised learning.Advances in Neural Information Processing Systems, 35:32424–32437, 2022

Baixu Chen, Junguang Jiang, Ximei Wang, Pengfei Wan, Jianmin Wang, and Mingsheng Long. Debiased self-training for semi-supervised learning.Advances in Neural Information Processing Systems, 35:32424–32437, 2022

2022
[4]

Contrastive chain-of-thought prompting.arXiv preprint arXiv:2311.09277, 2023

Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, and Lidong Bing. Contrastive chain-of-thought prompting.arXiv preprint arXiv:2311.09277, 2023

work page arXiv 2023
[5]

Navigate through enigmatic labyrinth a survey of chain of thought reasoning: Advances, frontiers and future

Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. Navigate through enigmatic labyrinth a survey of chain of thought reasoning: Advances, frontiers and future. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11...

2024
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

2024
[8]

Self-training converts weak learners to strong learners in mixture models

Spencer Frei, Difan Zou, Zixiang Chen, and Quanquan Gu. Self-training converts weak learners to strong learners in mixture models. InInternational Conference on Artificial Intelligence and Statistics, pages 8003–8021. PMLR, 2022

2022
[9]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies.Transactions of the Association for Computational Linguistics, 9:346–361, 2021

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies.Transactions of the Association for Computational Linguistics, 9:346–361, 2021

2021
[10]

Semi-supervised learning by entropy minimization.Advances in neural information processing systems, 17, 2004

Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization.Advances in neural information processing systems, 17, 2004

2004
[11]

Trustmatch: mitigating pseudo-label bias in semi-supervised learning with trust-aware refinement

Hongyang He and Yundi Hong. Trustmatch: mitigating pseudo-label bias in semi-supervised learning with trust-aware refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 594–603, 2025

2025
[12]

Trico: Triadic game-theoretic co-training for robust semi-supervised learning, 2025

Hongyang He, Xinyuan Song, Yangfan He, Zeyu Zhang, Yanshu Li, Haochen You, Lifan Sun, and Wenqiao Zhang. Trico: Triadic game-theoretic co-training for robust semi-supervised learning, 2025

2025
[13]

4s-classifier: Empowering conservation through semi-supervised learning for rare and endangered species

Hongyang He, Hongyang Xie, Guodong Shen, Boyang Fu, Haochen You, and Victor Sanchez. 4s-classifier: Empowering conservation through semi-supervised learning for rare and endangered species. In2025 International Joint Conference on Neural Networks (IJCNN), pages 1–10. IEEE, 2025

2025
[14]

Semi-vim: Bidirectional state space model for mitigating label imbalance in semi-supervised learning

Hongyang He, Hongyang Xie, Haochen You, and Victor Sanchez. Semi-vim: Bidirectional state space model for mitigating label imbalance in semi-supervised learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 765–774, 2025. 15

2025
[15]

Token-aware representation augmentation for fine-grained semi-supervised learning

Hongyang He, Yan Zhong, Xinyuan Song, Daizong Liu, and Victor Sanchez. Token-aware representation augmentation for fine-grained semi-supervised learning. InThe Third Conference on Parsimony and Learning (Proceedings Track), 2026

2026
[16]

Newton-coupled dual-teacher semi-supervised learning framework

Hongyang He, Yan Zhong, Xinyuan Song, Daizong Lui, Xuanyu Liu, and Victor Sanchez Silva. Newton-coupled dual-teacher semi-supervised learning framework. 2026

2026
[17]

Partmatch: part-aware pseudo-labeling for fine-grained semi-supervised learning

Yundi Hong, Hongyang He, Yanbin Li, Ao Li, and Victor Sanchez Silva. Partmatch: part-aware pseudo-labeling for fine-grained semi-supervised learning. InIEEE International Conference on Multimedia and Expo 2026. IEEE, 2026

2026
[18]

Learning to solve arithmetic word problems with verb categorization

Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 523–533, 2014

2014
[19]

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

2022
[20]

Parsing algebraic word problems into equations.Transactions of the Association for Computational Linguistics, 3:585–597, 2015

Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word problems into equations.Transactions of the Association for Computational Linguistics, 3:585–597, 2015

2015
[21]

Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks

Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. InWorkshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta, 2013

2013
[22]

Program induction by rationale generation: Learning to solve and explain algebraic word problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InProceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 158–167, 2017

2017
[23]

Deductive verification of chain-of-thought reasoning.Advances in Neural Information Processing Systems, 36:36407–36433, 2023

Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. Deductive verification of chain-of-thought reasoning.Advances in Neural Information Processing Systems, 36:36407–36433, 2023

2023
[24]

How much can rag help the reasoning of llm?arXiv preprint arXiv:2410.02338, 2024

Jingyu Liu, Jiaen Lin, and Yong Liu. How much can rag help the reasoning of llm?arXiv preprint arXiv:2410.02338, 2024

work page arXiv 2024
[25]

Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018

1979
[26]

Uncertainty-aware self-training for few-shot text classification

Subhabrata Mukherjee and Ahmed Awadallah. Uncertainty-aware self-training for few-shot text classification. Advances in Neural Information Processing Systems, 33:21199–21212, 2020

2020
[27]

Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024

Sania Nayab, Giulio Rossolini, Marco Simoni, Andrea Saracino, Giorgio Buttazzo, Nicolamaria Manes, and Fabrizio Giacomelli. Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024

work page arXiv 2024
[28]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 2080–2094, 2021

2021
[29]

Solving general arithmetic word problems

Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 conference on empirical methods in natural language processing, pages 1743–1752, 2015

2015
[30]

Adaptive communication receivers.IEEE Transactions on Information Theory, 11(2):167–174, 1965

H Scudder. Adaptive communication receivers.IEEE Transactions on Information Theory, 11(2):167–174, 1965

1965
[31]

Fixmatch: Simplifying semi-supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020

Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020

2020
[32]

A survey of reasoning with foundation models: Concepts, methodologies, and outlook.ACM Computing Surveys, 57(11):1–43, 2025

Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, et al. A survey of reasoning with foundation models: Concepts, methodologies, and outlook.ACM Computing Surveys, 57(11):1–43, 2025

2025
[33]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American 16 Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019

2019
[34]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

2017
[35]

Yolo-lrdd: A lightweight method for road damage detection based on improved yolov5s.EURASIP Journal on Advances in Signal Processing, 2022(1): 98, 2022

Fang Wan, Chen Sun, Hongyang He, Guangbo Lei, Li Xu, and Teng Xiao. Yolo-lrdd: A lightweight method for road damage detection based on improved yolov5s.EURASIP Journal on Advances in Signal Processing, 2022(1): 98, 2022

2022
[36]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models

Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 2609–2634, 2023

2023
[37]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022
[39]

Rethinking chain-of- thought from the perspective of self-training.arXiv preprint arXiv:2412.10827, 2024

Zongqian Wu, Baoduo Xu, Ruochen Cui, Mengmeng Zhan, Xiaofeng Zhu, and Lei Feng. Rethinking chain-of- thought from the perspective of self-training.arXiv preprint arXiv:2412.10827, 2024

work page arXiv 2024
[40]

Grdt: Towards robust deepfake detection using geometric representation distribution and texture

Hongyang Xie, Hongyang He, Boyang Fu, and Victor Sanchez. Grdt: Towards robust deepfake detection using geometric representation distribution and texture. InProceedings of the Winter Conference on Applications of Computer Vision, pages 734–744, 2025

2025
[41]

Re-reading improves reasoning in large language models

Xiaohan Xu, Chongyang Tao, Tao Shen, Can Xu, Hongbo Xu, Guodong Long, Jian-Guang Lou, and Shuai Ma. Re-reading improves reasoning in large language models. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 15549–15575, 2024

2024
[42]

A survey on deep semi-supervised learning.IEEE transactions on knowledge and data engineering, 35(9):8934–8954, 2022

Xiangli Yang, Zixing Song, Irwin King, and Zenglin Xu. A survey on deep semi-supervised learning.IEEE transactions on knowledge and data engineering, 35(9):8934–8954, 2022

2022
[43]

Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling.Advances in neural information processing systems, 34:18408–18419, 2021

Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling.Advances in neural information processing systems, 34:18408–18419, 2021

2021
[44]

Automatic Chain of Thought Prompting in Large Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Evaluation of openai o1: Opportunities and challenges of agi.arXiv preprint arXiv:2409.18486, 2024

Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Zeyu Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, et al. Evaluation of openai o1: Opportunities and challenges of agi.arXiv preprint arXiv:2409.18486, 2024

work page arXiv 2024
[46]

Confidence regularized self-training

Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. Confidence regularized self-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5982–5991, 2019. 17 1

2019

[1] [1]

Self- training: A survey.Neurocomputing, 616:128904, 2025

Massih-Reza Amini, Vasilii Feofanov, Loic Pauletto, Lies Hadjadj, Emilie Devijver, and Yury Maximov. Self- training: A survey.Neurocomputing, 616:128904, 2025

2025

[2] [2]

Boosting the margin: A new explanation for the effectiveness of voting methods.The annals of statistics, 26(5):1651–1686, 1998

Peter Bartlett, Yoav Freund, Wee Sun Lee, and Robert E Schapire. Boosting the margin: A new explanation for the effectiveness of voting methods.The annals of statistics, 26(5):1651–1686, 1998

1998

[3] [3]

Debiased self-training for semi-supervised learning.Advances in Neural Information Processing Systems, 35:32424–32437, 2022

Baixu Chen, Junguang Jiang, Ximei Wang, Pengfei Wan, Jianmin Wang, and Mingsheng Long. Debiased self-training for semi-supervised learning.Advances in Neural Information Processing Systems, 35:32424–32437, 2022

2022

[4] [4]

Contrastive chain-of-thought prompting.arXiv preprint arXiv:2311.09277, 2023

Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, and Lidong Bing. Contrastive chain-of-thought prompting.arXiv preprint arXiv:2311.09277, 2023

work page arXiv 2023

[5] [5]

Navigate through enigmatic labyrinth a survey of chain of thought reasoning: Advances, frontiers and future

Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. Navigate through enigmatic labyrinth a survey of chain of thought reasoning: Advances, frontiers and future. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11...

2024

[6] [6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

2024

[8] [8]

Self-training converts weak learners to strong learners in mixture models

Spencer Frei, Difan Zou, Zixiang Chen, and Quanquan Gu. Self-training converts weak learners to strong learners in mixture models. InInternational Conference on Artificial Intelligence and Statistics, pages 8003–8021. PMLR, 2022

2022

[9] [9]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies.Transactions of the Association for Computational Linguistics, 9:346–361, 2021

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies.Transactions of the Association for Computational Linguistics, 9:346–361, 2021

2021

[10] [10]

Semi-supervised learning by entropy minimization.Advances in neural information processing systems, 17, 2004

Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization.Advances in neural information processing systems, 17, 2004

2004

[11] [11]

Trustmatch: mitigating pseudo-label bias in semi-supervised learning with trust-aware refinement

Hongyang He and Yundi Hong. Trustmatch: mitigating pseudo-label bias in semi-supervised learning with trust-aware refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 594–603, 2025

2025

[12] [12]

Trico: Triadic game-theoretic co-training for robust semi-supervised learning, 2025

Hongyang He, Xinyuan Song, Yangfan He, Zeyu Zhang, Yanshu Li, Haochen You, Lifan Sun, and Wenqiao Zhang. Trico: Triadic game-theoretic co-training for robust semi-supervised learning, 2025

2025

[13] [13]

4s-classifier: Empowering conservation through semi-supervised learning for rare and endangered species

Hongyang He, Hongyang Xie, Guodong Shen, Boyang Fu, Haochen You, and Victor Sanchez. 4s-classifier: Empowering conservation through semi-supervised learning for rare and endangered species. In2025 International Joint Conference on Neural Networks (IJCNN), pages 1–10. IEEE, 2025

2025

[14] [14]

Semi-vim: Bidirectional state space model for mitigating label imbalance in semi-supervised learning

Hongyang He, Hongyang Xie, Haochen You, and Victor Sanchez. Semi-vim: Bidirectional state space model for mitigating label imbalance in semi-supervised learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 765–774, 2025. 15

2025

[15] [15]

Token-aware representation augmentation for fine-grained semi-supervised learning

Hongyang He, Yan Zhong, Xinyuan Song, Daizong Liu, and Victor Sanchez. Token-aware representation augmentation for fine-grained semi-supervised learning. InThe Third Conference on Parsimony and Learning (Proceedings Track), 2026

2026

[16] [16]

Newton-coupled dual-teacher semi-supervised learning framework

Hongyang He, Yan Zhong, Xinyuan Song, Daizong Lui, Xuanyu Liu, and Victor Sanchez Silva. Newton-coupled dual-teacher semi-supervised learning framework. 2026

2026

[17] [17]

Partmatch: part-aware pseudo-labeling for fine-grained semi-supervised learning

Yundi Hong, Hongyang He, Yanbin Li, Ao Li, and Victor Sanchez Silva. Partmatch: part-aware pseudo-labeling for fine-grained semi-supervised learning. InIEEE International Conference on Multimedia and Expo 2026. IEEE, 2026

2026

[18] [18]

Learning to solve arithmetic word problems with verb categorization

Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 523–533, 2014

2014

[19] [19]

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

2022

[20] [20]

Parsing algebraic word problems into equations.Transactions of the Association for Computational Linguistics, 3:585–597, 2015

Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word problems into equations.Transactions of the Association for Computational Linguistics, 3:585–597, 2015

2015

[21] [21]

Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks

Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. InWorkshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta, 2013

2013

[22] [22]

Program induction by rationale generation: Learning to solve and explain algebraic word problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InProceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 158–167, 2017

2017

[23] [23]

Deductive verification of chain-of-thought reasoning.Advances in Neural Information Processing Systems, 36:36407–36433, 2023

Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. Deductive verification of chain-of-thought reasoning.Advances in Neural Information Processing Systems, 36:36407–36433, 2023

2023

[24] [24]

How much can rag help the reasoning of llm?arXiv preprint arXiv:2410.02338, 2024

Jingyu Liu, Jiaen Lin, and Yong Liu. How much can rag help the reasoning of llm?arXiv preprint arXiv:2410.02338, 2024

work page arXiv 2024

[25] [25]

Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018

1979

[26] [26]

Uncertainty-aware self-training for few-shot text classification

Subhabrata Mukherjee and Ahmed Awadallah. Uncertainty-aware self-training for few-shot text classification. Advances in Neural Information Processing Systems, 33:21199–21212, 2020

2020

[27] [27]

Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024

Sania Nayab, Giulio Rossolini, Marco Simoni, Andrea Saracino, Giorgio Buttazzo, Nicolamaria Manes, and Fabrizio Giacomelli. Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024

work page arXiv 2024

[28] [28]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 2080–2094, 2021

2021

[29] [29]

Solving general arithmetic word problems

Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 conference on empirical methods in natural language processing, pages 1743–1752, 2015

2015

[30] [30]

Adaptive communication receivers.IEEE Transactions on Information Theory, 11(2):167–174, 1965

H Scudder. Adaptive communication receivers.IEEE Transactions on Information Theory, 11(2):167–174, 1965

1965

[31] [31]

Fixmatch: Simplifying semi-supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020

Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020

2020

[32] [32]

A survey of reasoning with foundation models: Concepts, methodologies, and outlook.ACM Computing Surveys, 57(11):1–43, 2025

Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, et al. A survey of reasoning with foundation models: Concepts, methodologies, and outlook.ACM Computing Surveys, 57(11):1–43, 2025

2025

[33] [33]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American 16 Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019

2019

[34] [34]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

2017

[35] [35]

Yolo-lrdd: A lightweight method for road damage detection based on improved yolov5s.EURASIP Journal on Advances in Signal Processing, 2022(1): 98, 2022

Fang Wan, Chen Sun, Hongyang He, Guangbo Lei, Li Xu, and Teng Xiao. Yolo-lrdd: A lightweight method for road damage detection based on improved yolov5s.EURASIP Journal on Advances in Signal Processing, 2022(1): 98, 2022

2022

[36] [36]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models

Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 2609–2634, 2023

2023

[37] [37]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[38] [38]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022

[39] [39]

Rethinking chain-of- thought from the perspective of self-training.arXiv preprint arXiv:2412.10827, 2024

Zongqian Wu, Baoduo Xu, Ruochen Cui, Mengmeng Zhan, Xiaofeng Zhu, and Lei Feng. Rethinking chain-of- thought from the perspective of self-training.arXiv preprint arXiv:2412.10827, 2024

work page arXiv 2024

[40] [40]

Grdt: Towards robust deepfake detection using geometric representation distribution and texture

Hongyang Xie, Hongyang He, Boyang Fu, and Victor Sanchez. Grdt: Towards robust deepfake detection using geometric representation distribution and texture. InProceedings of the Winter Conference on Applications of Computer Vision, pages 734–744, 2025

2025

[41] [41]

Re-reading improves reasoning in large language models

Xiaohan Xu, Chongyang Tao, Tao Shen, Can Xu, Hongbo Xu, Guodong Long, Jian-Guang Lou, and Shuai Ma. Re-reading improves reasoning in large language models. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 15549–15575, 2024

2024

[42] [42]

A survey on deep semi-supervised learning.IEEE transactions on knowledge and data engineering, 35(9):8934–8954, 2022

Xiangli Yang, Zixing Song, Irwin King, and Zenglin Xu. A survey on deep semi-supervised learning.IEEE transactions on knowledge and data engineering, 35(9):8934–8954, 2022

2022

[43] [43]

Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling.Advances in neural information processing systems, 34:18408–18419, 2021

Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling.Advances in neural information processing systems, 34:18408–18419, 2021

2021

[44] [44]

Automatic Chain of Thought Prompting in Large Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[45] [45]

Evaluation of openai o1: Opportunities and challenges of agi.arXiv preprint arXiv:2409.18486, 2024

Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Zeyu Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, et al. Evaluation of openai o1: Opportunities and challenges of agi.arXiv preprint arXiv:2409.18486, 2024

work page arXiv 2024

[46] [46]

Confidence regularized self-training

Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. Confidence regularized self-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5982–5991, 2019. 17 1

2019