pith. sign in

arxiv: 2604.21204 · v1 · submitted 2026-04-23 · 💻 cs.CL · cs.AI· cs.IR

On Reasoning Behind Next Occupation Recommendation

Pith reviewed 2026-05-09 22:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords LLM fine-tuningoccupation predictionreason generationcareer recommendationuser preference summaryLLM-as-a-Judgesequential recommendation
0
0 comments X

The pith

Fine-tuning LLMs on oracle career reasons matches fully supervised accuracy for next occupation prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs can be improved for recommending a user's next occupation by first generating a short reason that summarizes preferences from their education and past jobs, then feeding that reason into a predictor. High-quality oracle reasons are created with an LLM judge that scores them on factuality, coherence, and utility, after which smaller LLMs are fine-tuned to produce both the reasons and the final predictions. Experiments indicate the resulting accuracy reaches the level of fully supervised models and exceeds unsupervised ones, while a single model handling both steps works better than two separate models and performance tracks the quality of the reasons. A reader would care because the method turns general language models into effective career recommenders without needing large labeled datasets for every transition.

Core claim

Deriving high-quality oracle reasons via an LLM-as-a-Judge and fine-tuning LLMs on them for both reason generation and occupation prediction raises next occupation prediction accuracy to levels comparable with fully supervised methods and above unsupervised baselines; a single fine-tuned LLM outperforms two models trained separately on the two tasks; and prediction accuracy depends directly on the quality of the generated reasons.

What carries the argument

A two-step pipeline in which a reason generator produces a preference summary from a user's education and career history that then serves as input to an occupation predictor, with both components improved by fine-tuning on oracle reasons.

If this is right

  • Next occupation prediction accuracy reaches levels comparable to fully supervised methods.
  • A single LLM fine-tuned on both tasks outperforms two separately fine-tuned models.
  • Prediction accuracy rises or falls with the measured quality of the generated reasons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reason-generation step could be tested in other sequential recommendation settings such as course or skill suggestions.
  • The method may lower the cost of building domain-specific recommenders by substituting LLM-generated reasons for human labels.
  • Explicit preference summaries could make model outputs more inspectable and allow users to correct the reason before the final prediction.

Load-bearing premise

That the high-quality oracle reasons generated by the LLM judge accurately reflect the unobserved user preferences that actually drive occupation choices.

What would settle it

Run the fine-tuned model on a test set of user histories but replace the generated reasons with random or low-scoring text and check whether the accuracy gain over the base LLM disappears.

Figures

Figures reproduced from arXiv: 2604.21204 by Ee-Peng Lim, Hieu Hien Mai, Lei Wang, Palakorn Achananuparp, Shan Dong, Yao Lu.

Figure 1
Figure 1. Figure 1: Reasoning-Augmented Occupation Prediction Framework. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

In this work, we develop a novel reasoning approach to enhance the performance of large language models (LLMs) in future occupation prediction. In this approach, a reason generator first derives a ``reason'' for a user using his/her past education and career history. The reason summarizes the user's preference and is used as the input of an occupation predictor to recommend the user's next occupation. This two-step occupation prediction approach is, however, non-trivial as LLMs are not aligned with career paths or the unobserved reasons behind each occupation decision. We therefore propose to fine-tune LLMs improving their reasoning and occupation prediction performance. We first derive high-quality oracle reasons, as measured by factuality, coherence and utility criteria, using a LLM-as-a-Judge. These oracle reasons are then used to fine-tune small LLMs to perform reason generation and next occupation prediction. Our extensive experiments show that: (a) our approach effectively enhances LLM's accuracy in next occupation prediction making them comparable to fully supervised methods and outperforming unsupervised methods; (b) a single LLM fine-tuned to perform reason generation and occupation prediction outperforms two LLMs fine-tuned to perform the tasks separately; and (c) the next occupation prediction accuracy depends on the quality of generated reasons. Our code is available at https://github.com/Sarasarahhhhh/job_prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a two-step LLM-based framework for next-occupation recommendation. A reason generator first produces an 'oracle reason' summarizing a user's latent preferences from their education and career history; these reasons are scored by an LLM-as-a-Judge on factuality, coherence, and utility. The oracle reasons are then used to fine-tune smaller LLMs that jointly perform reason generation and occupation prediction. Experiments claim that the resulting models match fully supervised baselines, outperform unsupervised ones, that joint training beats separate models, and that prediction accuracy scales with reason quality. Public code is provided.

Significance. If the central empirical claims hold after validation, the work would demonstrate that synthetic reasoning traces can serve as an effective bridge between unsupervised pre-training and supervised task performance in a real-world sequential decision domain, reducing reliance on expensive human labels for career-path modeling. The joint-training result and the reported dependence of accuracy on reason quality are potentially reusable insights for other preference-inference tasks. The public code release strengthens reproducibility.

major comments (2)
  1. [Method (§3) and Experiments (§4)] The central claim that fine-tuning on LLM-as-a-Judge oracle reasons yields predictors 'comparable to fully supervised methods' (abstract and §4) rests on the untested premise that these synthetic reasons encode the unobserved user preferences driving career transitions. No human validation, inter-annotator agreement, or correlation between judge scores and downstream occupation accuracy is reported; without this, the accuracy gains could be explained by data augmentation or distillation rather than genuine reasoning.
  2. [Experiments (§4.3)] The single-LLM joint-training advantage (claim (b) in abstract) is presented as a key result, yet the paper does not report whether this advantage persists under distribution shift or when the judge LLM is replaced by a different model family. This is load-bearing for the practical recommendation to use a single jointly fine-tuned model.
minor comments (2)
  1. [Method (§3)] Notation for the reason generator and occupation predictor is introduced without a clear diagram or pseudocode; a figure showing the data flow from history → oracle reason → fine-tuning would improve clarity.
  2. [Experiments (§4)] The abstract states that 'the next occupation prediction accuracy depends on the quality of generated reasons' but does not quantify this dependence (e.g., via correlation or ablation tables); a dedicated table or plot would make the claim verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below and outline planned revisions.

read point-by-point responses
  1. Referee: [Method (§3) and Experiments (§4)] The central claim that fine-tuning on LLM-as-a-Judge oracle reasons yields predictors 'comparable to fully supervised methods' (abstract and §4) rests on the untested premise that these synthetic reasons encode the unobserved user preferences driving career transitions. No human validation, inter-annotator agreement, or correlation between judge scores and downstream occupation accuracy is reported; without this, the accuracy gains could be explained by data augmentation or distillation rather than genuine reasoning.

    Authors: We agree that human validation and inter-annotator agreement would strengthen claims about the reasons capturing latent preferences. The manuscript does not include such validation. However, we do report that accuracy scales with reason quality (claim (c) and §4.3), which provides evidence that performance gains track the assessed content of the reasons rather than arising solely from augmentation or distillation effects. We will revise to explicitly discuss the lack of human evaluation as a limitation and expand the analysis of quality-accuracy dependence to address alternative explanations. revision: partial

  2. Referee: [Experiments (§4.3)] The single-LLM joint-training advantage (claim (b) in abstract) is presented as a key result, yet the paper does not report whether this advantage persists under distribution shift or when the judge LLM is replaced by a different model family. This is load-bearing for the practical recommendation to use a single jointly fine-tuned model.

    Authors: Our experiments establish the joint-training advantage within the reported dataset and judge model. We did not evaluate robustness under distribution shift or alternative judge families. We will revise the manuscript to qualify the scope of this result and identify these checks as valuable future directions. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical fine-tuning pipeline for occupation prediction

full rationale

The paper's results derive from an empirical pipeline: LLM-as-a-Judge generates oracle reasons scored on factuality/coherence/utility from user histories, these are used as training signals to fine-tune models for joint reason generation and next-occupation prediction, and accuracy is measured against held-out real next-occupation labels. No equations or derivations reduce to self-defined inputs by construction, no fitted parameters are renamed as predictions, and no self-citations provide load-bearing uniqueness theorems or ansatzes. The evaluation targets are external data independent of the generated reasons, making the approach a standard supervised fine-tuning setup with auxiliary synthetic labels rather than a closed loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the two-step process and the quality of LLM-generated reasons, with no explicit free parameters but implicit reliance on the fine-tuning process and evaluation criteria.

axioms (1)
  • domain assumption LLM-as-a-Judge can reliably evaluate and generate high-quality reasons based on factuality, coherence, and utility
    This is used to create training data for fine-tuning.
invented entities (1)
  • oracle reasons no independent evidence
    purpose: High-quality training data for fine-tuning reason generation and prediction
    Generated by LLM-as-Judge without external validation mentioned in the abstract.

pith-pipeline@v0.9.0 · 5547 in / 1293 out tokens · 50710 ms · 2026-05-09T22:51:34.116275+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    arXiv (2023)

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv (2023)

  2. [2]

    arXiv (2023)

    Decorte, J.J., Van Hautte, J., Deleu, J., Develder, C., Demeester, T.: Career path prediction using resume representation learning and skill-based matching. arXiv (2023)

  3. [3]

    In: AAAI (2024)

    Du, Y ., Luo, D., Yan, R., Wang, X., Liu, H., Zhu, H., Song, Y ., Zhang, J.: Enhancing job recommendation through llm-based generative adversarial networks. In: AAAI (2024)

  4. [4]

    arXiv (2024)

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv (2024)

  5. [5]

    In: ICDM (2018)

    Kang, W.C., McAuley, J.: Self-attentive sequential recommendation. In: ICDM (2018)

  6. [6]

    NeurIPS (2022)

    Kojima, T., Gu, S.S., Reid, M., Matsuo, Y ., Iwasawa, Y .: Large language models are zero-shot reasoners. NeurIPS (2022)

  7. [7]

    arXiv (2025)

    Lee, J., Hockenmaier, J.: Evaluating step-by-step reasoning traces: A survey. arXiv (2025)

  8. [8]

    arXiv (2025)

    Liu, S., Fang, W., Hu, Z., Zhang, J., Zhou, Y ., Zhang, K., Tu, R., Lin, T.E., Huang, F., Song, M., et al.: A survey of direct preference optimization. arXiv (2025)

  9. [9]

    Psychometrika (1947)

    McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika (1947)

  10. [10]

    OpenAI: Openai o1 and o-series models.https://openai.com(2025)

  11. [11]

    In: RecSys (2011)

    Paparrizos, I., Cambazoglu, B.B., Gionis, A.: Machine learned job recommendation. In: RecSys (2011)

  12. [12]

    NeurIPS (2023)

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. NeurIPS (2023)

  13. [13]

    arXiv (2023)

    Reid, A., et al.: Gemini: A family of highly capable multimodal models. arXiv (2023)

  14. [14]

    In: CIKM (2019)

    Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., Jiang, P.: Bert4rec: Sequential recommendation with bidirectional encoder representations. In: CIKM (2019)

  15. [15]

    arXiv (2024)

    Tsai, A.Y ., Kraft, A., Jin, L., Cai, C., Hosseini, A., Xu, T., Zhang, Z., Hong, L., Chi, E.H., Yi, X.: Leveraging llm reasoning enhances personalized recommender systems. arXiv (2024)

  16. [16]

    arXiv (2023)

    Wang, L., Lim, E.P.: Zero-shot next-item recommendation using large pretrained language models. arXiv (2023)

  17. [17]

    NeurIPS (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V ., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. NeurIPS (2022)

  18. [18]

    In: AAAI (2024)

    Wu, L., Qiu, Z., Zheng, Z., Zhu, H., Chen, E.: Exploring large language model for graph data understanding in online job recommendations. In: AAAI (2024)

  19. [19]

    WSDM Workshop on Computational Jobs Marketplace (2022)

    Yamashita, M., Li, Y ., Tran, T., Zhang, Y ., Lee, D.: Looking further into the future: Career pathway prediction. WSDM Workshop on Computational Jobs Marketplace (2022)

  20. [20]

    arXiv (2025)

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv (2025)

  21. [21]

    TOIS (2025)

    Zhang, J., Xie, R., Hou, Y ., Zhao, X., Lin, L., Wen, J.R.: Recommendation as instruction following: A large language model empowered recommendation approach. TOIS (2025)

  22. [22]

    In: KDD (2021) On Reasoning Behind Next Occupation Recommendation 13

    Zhang, L., Zhou, D., Zhu, H., Xu, T., Zha, R., Chen, E., Xiong, H.: Attentive heterogeneous graph embedding for job mobility prediction. In: KDD (2021) On Reasoning Behind Next Occupation Recommendation 13

  23. [23]

    NeurIPS (2023)

    Zheng, L., Chiang, W.L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS (2023)

  24. [24]

    arXiv (2024)

    Zheng, Y ., Zhang, R., Zhang, J., Ye, Y ., Luo, Z., Feng, Z., Ma, Y .: Llamafactory: Unified efficient fine-tuning of 100+ language models. arXiv (2024)

  25. [25]

    arXiv (2023)

    Zheng, Z., Qiu, Z., Hu, X., Wu, L., Zhu, H., Xiong, H.: Generative job recommendations with large language model. arXiv (2023)