On Reasoning Behind Next Occupation Recommendation
Pith reviewed 2026-05-09 22:51 UTC · model grok-4.3
The pith
Fine-tuning LLMs on oracle career reasons matches fully supervised accuracy for next occupation prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Deriving high-quality oracle reasons via an LLM-as-a-Judge and fine-tuning LLMs on them for both reason generation and occupation prediction raises next occupation prediction accuracy to levels comparable with fully supervised methods and above unsupervised baselines; a single fine-tuned LLM outperforms two models trained separately on the two tasks; and prediction accuracy depends directly on the quality of the generated reasons.
What carries the argument
A two-step pipeline in which a reason generator produces a preference summary from a user's education and career history that then serves as input to an occupation predictor, with both components improved by fine-tuning on oracle reasons.
If this is right
- Next occupation prediction accuracy reaches levels comparable to fully supervised methods.
- A single LLM fine-tuned on both tasks outperforms two separately fine-tuned models.
- Prediction accuracy rises or falls with the measured quality of the generated reasons.
Where Pith is reading between the lines
- The same reason-generation step could be tested in other sequential recommendation settings such as course or skill suggestions.
- The method may lower the cost of building domain-specific recommenders by substituting LLM-generated reasons for human labels.
- Explicit preference summaries could make model outputs more inspectable and allow users to correct the reason before the final prediction.
Load-bearing premise
That the high-quality oracle reasons generated by the LLM judge accurately reflect the unobserved user preferences that actually drive occupation choices.
What would settle it
Run the fine-tuned model on a test set of user histories but replace the generated reasons with random or low-scoring text and check whether the accuracy gain over the base LLM disappears.
Figures
read the original abstract
In this work, we develop a novel reasoning approach to enhance the performance of large language models (LLMs) in future occupation prediction. In this approach, a reason generator first derives a ``reason'' for a user using his/her past education and career history. The reason summarizes the user's preference and is used as the input of an occupation predictor to recommend the user's next occupation. This two-step occupation prediction approach is, however, non-trivial as LLMs are not aligned with career paths or the unobserved reasons behind each occupation decision. We therefore propose to fine-tune LLMs improving their reasoning and occupation prediction performance. We first derive high-quality oracle reasons, as measured by factuality, coherence and utility criteria, using a LLM-as-a-Judge. These oracle reasons are then used to fine-tune small LLMs to perform reason generation and next occupation prediction. Our extensive experiments show that: (a) our approach effectively enhances LLM's accuracy in next occupation prediction making them comparable to fully supervised methods and outperforming unsupervised methods; (b) a single LLM fine-tuned to perform reason generation and occupation prediction outperforms two LLMs fine-tuned to perform the tasks separately; and (c) the next occupation prediction accuracy depends on the quality of generated reasons. Our code is available at https://github.com/Sarasarahhhhh/job_prediction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a two-step LLM-based framework for next-occupation recommendation. A reason generator first produces an 'oracle reason' summarizing a user's latent preferences from their education and career history; these reasons are scored by an LLM-as-a-Judge on factuality, coherence, and utility. The oracle reasons are then used to fine-tune smaller LLMs that jointly perform reason generation and occupation prediction. Experiments claim that the resulting models match fully supervised baselines, outperform unsupervised ones, that joint training beats separate models, and that prediction accuracy scales with reason quality. Public code is provided.
Significance. If the central empirical claims hold after validation, the work would demonstrate that synthetic reasoning traces can serve as an effective bridge between unsupervised pre-training and supervised task performance in a real-world sequential decision domain, reducing reliance on expensive human labels for career-path modeling. The joint-training result and the reported dependence of accuracy on reason quality are potentially reusable insights for other preference-inference tasks. The public code release strengthens reproducibility.
major comments (2)
- [Method (§3) and Experiments (§4)] The central claim that fine-tuning on LLM-as-a-Judge oracle reasons yields predictors 'comparable to fully supervised methods' (abstract and §4) rests on the untested premise that these synthetic reasons encode the unobserved user preferences driving career transitions. No human validation, inter-annotator agreement, or correlation between judge scores and downstream occupation accuracy is reported; without this, the accuracy gains could be explained by data augmentation or distillation rather than genuine reasoning.
- [Experiments (§4.3)] The single-LLM joint-training advantage (claim (b) in abstract) is presented as a key result, yet the paper does not report whether this advantage persists under distribution shift or when the judge LLM is replaced by a different model family. This is load-bearing for the practical recommendation to use a single jointly fine-tuned model.
minor comments (2)
- [Method (§3)] Notation for the reason generator and occupation predictor is introduced without a clear diagram or pseudocode; a figure showing the data flow from history → oracle reason → fine-tuning would improve clarity.
- [Experiments (§4)] The abstract states that 'the next occupation prediction accuracy depends on the quality of generated reasons' but does not quantify this dependence (e.g., via correlation or ablation tables); a dedicated table or plot would make the claim verifiable.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major point below and outline planned revisions.
read point-by-point responses
-
Referee: [Method (§3) and Experiments (§4)] The central claim that fine-tuning on LLM-as-a-Judge oracle reasons yields predictors 'comparable to fully supervised methods' (abstract and §4) rests on the untested premise that these synthetic reasons encode the unobserved user preferences driving career transitions. No human validation, inter-annotator agreement, or correlation between judge scores and downstream occupation accuracy is reported; without this, the accuracy gains could be explained by data augmentation or distillation rather than genuine reasoning.
Authors: We agree that human validation and inter-annotator agreement would strengthen claims about the reasons capturing latent preferences. The manuscript does not include such validation. However, we do report that accuracy scales with reason quality (claim (c) and §4.3), which provides evidence that performance gains track the assessed content of the reasons rather than arising solely from augmentation or distillation effects. We will revise to explicitly discuss the lack of human evaluation as a limitation and expand the analysis of quality-accuracy dependence to address alternative explanations. revision: partial
-
Referee: [Experiments (§4.3)] The single-LLM joint-training advantage (claim (b) in abstract) is presented as a key result, yet the paper does not report whether this advantage persists under distribution shift or when the judge LLM is replaced by a different model family. This is load-bearing for the practical recommendation to use a single jointly fine-tuned model.
Authors: Our experiments establish the joint-training advantage within the reported dataset and judge model. We did not evaluate robustness under distribution shift or alternative judge families. We will revise the manuscript to qualify the scope of this result and identify these checks as valuable future directions. revision: partial
Circularity Check
No circularity in empirical fine-tuning pipeline for occupation prediction
full rationale
The paper's results derive from an empirical pipeline: LLM-as-a-Judge generates oracle reasons scored on factuality/coherence/utility from user histories, these are used as training signals to fine-tune models for joint reason generation and next-occupation prediction, and accuracy is measured against held-out real next-occupation labels. No equations or derivations reduce to self-defined inputs by construction, no fitted parameters are renamed as predictions, and no self-citations provide load-bearing uniqueness theorems or ansatzes. The evaluation targets are external data independent of the generated reasons, making the approach a standard supervised fine-tuning setup with auxiliary synthetic labels rather than a closed loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-as-a-Judge can reliably evaluate and generate high-quality reasons based on factuality, coherence, and utility
invented entities (1)
-
oracle reasons
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv (2023)
work page 2023
-
[2]
Decorte, J.J., Van Hautte, J., Deleu, J., Develder, C., Demeester, T.: Career path prediction using resume representation learning and skill-based matching. arXiv (2023)
work page 2023
-
[3]
Du, Y ., Luo, D., Yan, R., Wang, X., Liu, H., Zhu, H., Song, Y ., Zhang, J.: Enhancing job recommendation through llm-based generative adversarial networks. In: AAAI (2024)
work page 2024
-
[4]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv (2024)
work page 2024
-
[5]
Kang, W.C., McAuley, J.: Self-attentive sequential recommendation. In: ICDM (2018)
work page 2018
-
[6]
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y ., Iwasawa, Y .: Large language models are zero-shot reasoners. NeurIPS (2022)
work page 2022
-
[7]
Lee, J., Hockenmaier, J.: Evaluating step-by-step reasoning traces: A survey. arXiv (2025)
work page 2025
-
[8]
Liu, S., Fang, W., Hu, Z., Zhang, J., Zhou, Y ., Zhang, K., Tu, R., Lin, T.E., Huang, F., Song, M., et al.: A survey of direct preference optimization. arXiv (2025)
work page 2025
-
[9]
McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika (1947)
work page 1947
-
[10]
OpenAI: Openai o1 and o-series models.https://openai.com(2025)
work page 2025
-
[11]
Paparrizos, I., Cambazoglu, B.B., Gionis, A.: Machine learned job recommendation. In: RecSys (2011)
work page 2011
-
[12]
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. NeurIPS (2023)
work page 2023
-
[13]
Reid, A., et al.: Gemini: A family of highly capable multimodal models. arXiv (2023)
work page 2023
-
[14]
Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., Jiang, P.: Bert4rec: Sequential recommendation with bidirectional encoder representations. In: CIKM (2019)
work page 2019
-
[15]
Tsai, A.Y ., Kraft, A., Jin, L., Cai, C., Hosseini, A., Xu, T., Zhang, Z., Hong, L., Chi, E.H., Yi, X.: Leveraging llm reasoning enhances personalized recommender systems. arXiv (2024)
work page 2024
-
[16]
Wang, L., Lim, E.P.: Zero-shot next-item recommendation using large pretrained language models. arXiv (2023)
work page 2023
-
[17]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V ., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. NeurIPS (2022)
work page 2022
-
[18]
Wu, L., Qiu, Z., Zheng, Z., Zhu, H., Chen, E.: Exploring large language model for graph data understanding in online job recommendations. In: AAAI (2024)
work page 2024
-
[19]
WSDM Workshop on Computational Jobs Marketplace (2022)
Yamashita, M., Li, Y ., Tran, T., Zhang, Y ., Lee, D.: Looking further into the future: Career pathway prediction. WSDM Workshop on Computational Jobs Marketplace (2022)
work page 2022
-
[20]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv (2025)
work page 2025
-
[21]
Zhang, J., Xie, R., Hou, Y ., Zhao, X., Lin, L., Wen, J.R.: Recommendation as instruction following: A large language model empowered recommendation approach. TOIS (2025)
work page 2025
-
[22]
In: KDD (2021) On Reasoning Behind Next Occupation Recommendation 13
Zhang, L., Zhou, D., Zhu, H., Xu, T., Zha, R., Chen, E., Xiong, H.: Attentive heterogeneous graph embedding for job mobility prediction. In: KDD (2021) On Reasoning Behind Next Occupation Recommendation 13
work page 2021
-
[23]
Zheng, L., Chiang, W.L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS (2023)
work page 2023
-
[24]
Zheng, Y ., Zhang, R., Zhang, J., Ye, Y ., Luo, Z., Feng, Z., Ma, Y .: Llamafactory: Unified efficient fine-tuning of 100+ language models. arXiv (2024)
work page 2024
-
[25]
Zheng, Z., Qiu, Z., Hu, X., Wu, L., Zhu, H., Xiong, H.: Generative job recommendations with large language model. arXiv (2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.