Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning

Hongyu Yang; Qida Tan; Wenchao Du

arxiv: 2605.27080 · v1 · pith:QQBLYXECnew · submitted 2026-05-26 · 💻 cs.CV

Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning

Qida Tan , Hongyu Yang , Wenchao Du This is my paper

Pith reviewed 2026-06-29 18:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords gaze estimationsemi-supervised learningcontrastive learningdisentangled representationsJacobian regularizationdomain generalizationappearance-based gaze

0 comments

The pith

Jacobian regularization splits gaze features into pitch and yaw subspaces so that ordinal contrastive learning on unlabeled images reaches competitive accuracy with only 5 percent labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a semi-supervised framework that reduces the amount of manual gaze annotations needed by exploiting large amounts of unlabeled images. It applies Jacobian regularization to force the learned features to separate into two independent subspaces, one for pitch and one for yaw. Within each subspace the model then uses the natural ordering of the angles to create positive and negative pairs for contrastive learning. Experiments across several benchmarks show that this approach matches or exceeds prior methods when only 20 percent, 10 percent, or even 5 percent of the usual labeled data is available, and it does so in both in-domain and cross-domain settings. The method is presented as a plug-and-play module that can be added to existing gaze estimators.

Core claim

By imposing Jacobian regularization the feature space factors into two subspaces each dedicated to a single gaze component; the intrinsic ordinal structure inside those subspaces then supplies a supervisory signal that lets contrastive learning operate effectively on unlabeled samples, producing robust gaze representations from far fewer annotations than full supervision requires.

What carries the argument

Disentangled Subspace Contrastive Learning (DSCL) that uses Jacobian regularization to create pitch-specific and yaw-specific subspaces whose ordinal rankings drive contrastive pairs on unlabeled data.

If this is right

The same architecture can be inserted into existing gaze networks without changing their backbone or loss structure.
Performance remains competitive under both in-domain and cross-domain evaluation when labeled data is reduced to 5 percent.
The ordinal contrastive term can be computed directly from the continuous angle values inside each disentangled subspace.
Domain generalization improves because the contrastive objective pulls together images that share the same angle ordering regardless of appearance shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the subspace separation generalizes, similar Jacobian-driven disentanglement could be tested on other continuous regression tasks such as head-pose or body orientation estimation.
The approach implies that explicit geometric regularization can substitute for some of the diversity that would otherwise have to come from additional labeled domains.
One could measure whether the learned subspaces remain orthogonal when the model is fine-tuned on entirely new camera setups or lighting conditions.

Load-bearing premise

Jacobian regularization will reliably isolate each gaze angle into its own subspace and the ordering inside those subspaces will supply a useful contrastive signal even when labels are scarce.

What would settle it

A controlled ablation in which the Jacobian term is removed, after which the subspaces mix pitch and yaw signals and accuracy with 5 percent labels falls back to the level of a standard supervised baseline on the same split.

Figures

Figures reproduced from arXiv: 2605.27080 by Hongyu Yang, Qida Tan, Wenchao Du.

**Figure 1.** Figure 1: Overview of the Disentangled Subspace Contrastive Learning (DSCL) Framework. DSCL first disentangles the gaze representation Z for specific gaze component (i.e., pitch ϕ and yaw ψ angles) regression with Jacobian regularization LJ on labeled samples, and then performs unsupervised contrastive learning on each disentangled subspace with unlabeled data. then the feature space Z ∼= R N decomposes into orthogo… view at source ↗

**Figure 2.** Figure 2: Visualization for regularization Jacobian matrix. Darker colors indicate larger absolute values, while lighter colors represent smaller ones. components. We further extend the DSCL to 3D gaze vector prediction by setting M = 3 to explore its applicability. All experiments are conducted on the Gaze360, with quantitative results summarized in [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 5.** Figure 5: Visualization results of the predicted gaze for unlabeled data. The first row is from CLSS, and the bottom row is DSCL. The red and green arrows denote the ground truth and predicted gaze vector, respectively. practical, real-world gaze estimation tasks. Impact Statement DSCL reduces the reliance of appearance-based gaze estimation on large-scale annotations by leveraging unlabeled data in a semi-supervis… view at source ↗

**Figure 4.** Figure 4: Visualization of the gaze feature distribution for unlabeled data. Different colors denote different gaze directions and close gaze direction share similar colors. and only slightly clearer clusters for yaw. Conversely, our DSCL generates a well-ordered feature space with distinct clusters that strongly correlate with true gaze directions. Furthermore, we visualize the prediction on the label space. As sh… view at source ↗

**Figure 6.** Figure 6: Sampled gaze label distribution under different semi-supervised settings. Case 2: If ∆R < 0, the ranking implies yB is ”smaller” than yA, contradicting the fact that yB,1 > yA,1. Case 3: If ∆R = 0, the distinct samples are mapped to the same rank, causing feature collapse. In an unsupervised setting, the gradients ∂rk ∂yk are unknown and effectively random, determined by the spectral properties of the data… view at source ↗

**Figure 7.** Figure 7: Visualized results for in-domain evaluations on Gaze360 testing set. The first row is from our baseline model, and the bottom row is from our DSCL method (i.e., Baseline + DSCL). The red and green arrows denote the ground truth and predicted gaze vectors, respectively. (a) PureGaze vs. PureGaze + DSCL (b) Baseline vs. Baseline + DSCL [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Visualized results for domain-generalization (DG) evaluations on MPIIGaze and EyeDiap datasets. The first row is from the original model, and the bottom row is from our DSCL method (i.e., PureGaze/Baseline + DSCL). The red and green arrows denote the ground truth and predicted gaze vectors, respectively. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Visualized results for domain-adaptation (DA) evaluations on MPIIGaze and EyeDiap datasets under 10% semi-supervised setting. The first row is from the original model, and the bottom row is from our DSCL method (i.e., PnP-GA/UnReGa + DSCL). The red and green arrows denote the ground truth and predicted gaze vectors, respectively [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Visualized results for supervised domain-adaptation (SDA) evaluations on MPIIGaze and EyeDiap datasets under 10% semi-supervised setting. The first row is from our baseline model, and the bottom row is from our DSCL method (i.e., Baseline + DSCL). The red and green arrows denote the ground truth and predicted gaze vectors, respectively. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

Appearance-based gaze estimation always suffers from poor generalization due to limited annotated samples and insufficient dataset diversity. Leading approaches adopt weakly supervised learning to generate large-scale pseudo-labeled data from unconstrained real-world scenarios, aiming to mitigate the domain shifts. In this work, we devise a simple yet effective semi-supervised learning architecture that leverages unlabeled data to enhance domain generalization, thereby reducing reliance on labor-intensive manual annotations. Our key insight is to impose Jacobian regularization to disentangle feature representations into discriminative subspaces dedicated to specific gaze components, such as pitch and yaw angles. We further exploit the intrinsic ordinal ranking within each subspace for contrastive learning, enabling the model to learn robust gaze representations from a small set of labeled samples and an abundance of unlabeled ones. This ultimately yields our Disentangled Subspace Contrastive Learning (DSCL) framework. Extensive experiments on multiple benchmarks verify that the proposed DSCL is plug-and-play, achieving competitive performance using only 20\%, 10\%, and even 5\% of the annotated data under both in-domain and cross-domain evaluation settings. The public code is available at \href{https://github.com/da60266/DSCL}{https://github.com/da60266/DSCL}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core claim rests on an unshown assumption that Jacobian regularization cleanly separates pitch/yaw subspaces whose internal order then gives a useful contrastive signal on unlabeled data.

read the letter

The headline takeaway is that DSCL pairs Jacobian regularization for subspace disentanglement with ordinal contrastive learning on the resulting subspaces, and the authors position this as a plug-and-play way to cut labeled data to 5-20% while staying competitive on gaze benchmarks in both in-domain and cross-domain settings.

What is actually new is the specific combination for this regression task; prior gaze work has used contrastive losses and some regularization, but the Jacobian step to force dedicated pitch/yaw subspaces followed by ordinal ranking inside them does not appear in the cited literature. The paper also ships public code, which is useful.

The practical angle is reasonable: gaze estimation is annotation-heavy and domain-shift prone, so any method that reliably works with far fewer labels would matter for HCI and driver monitoring. The abstract states competitive numbers at low label fractions, which is the kind of result people would want to test.

The soft spot is exactly the one the stress-test flags. Jacobian regularization is a smoothness penalty; nothing in the high-level description shows it isolates one gaze component from the other rather than just damping sensitivity overall. If the subspaces stay mixed or the ordinal signal is noisy, the contrastive term adds little and the reported gains cannot be credited to the claimed mechanism. The reader's report notes no tables, ablations, or error bars in the supplied material, so the central empirical claim stays unverified. That makes the load-bearing assumption hard to assess.

This is for people working on semi-supervised regression in computer vision, especially those already looking at gaze or head-pose tasks. It deserves a serious referee because the problem is real, the code is public, and the idea is concrete enough to check; an editor should send it out but flag the disentanglement claim for close scrutiny in review.

Referee Report

2 major / 1 minor

Summary. The paper proposes Disentangled Subspace Contrastive Learning (DSCL), a semi-supervised framework for appearance-based gaze estimation. Jacobian regularization is applied to disentangle feature representations into subspaces dedicated to individual gaze components (e.g., pitch and yaw); intrinsic ordinal ranking within each subspace then supplies a contrastive signal on unlabeled samples. The method is presented as plug-and-play and is claimed to deliver competitive performance on multiple benchmarks using only 5–20 % labeled data under both in-domain and cross-domain protocols, with public code released.

Significance. If the disentanglement mechanism and resulting gains are validated, the approach could meaningfully lower annotation costs for gaze estimation while improving cross-domain robustness. The public code release is a concrete strength that supports reproducibility and follow-up work.

major comments (2)

[§3] §3 (method description): The central claim that Jacobian regularization isolates subspaces each dedicated to a single gaze component (pitch or yaw) and that the resulting ordinal ranking supplies a reliable contrastive signal is not supported by any direct evidence. No visualizations of the subspaces, quantitative disentanglement metrics (e.g., mutual information or correlation between pitch/yaw directions), or ablation isolating the regularization’s effect on subspace purity are provided. Because every downstream claim (low-label performance, cross-domain gains) rests on this assumption, the attribution of results to the proposed mechanism remains unverified.
[Experiments] Experimental section / abstract: The manuscript asserts competitive results at 5 %, 10 %, and 20 % label fractions on multiple benchmarks, yet the provided description contains no quantitative tables, ablation studies on the Jacobian term, error-bar statistics, or statistical significance tests. Without these, the empirical support for the central performance claim cannot be assessed and the “plug-and-play” assertion cannot be evaluated.

minor comments (1)

The abstract would benefit from naming the specific benchmarks and evaluation protocols used to support the reported numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to strengthen the supporting evidence for the proposed mechanism and empirical claims.

read point-by-point responses

Referee: [§3] §3 (method description): The central claim that Jacobian regularization isolates subspaces each dedicated to a single gaze component (pitch or yaw) and that the resulting ordinal ranking supplies a reliable contrastive signal is not supported by any direct evidence. No visualizations of the subspaces, quantitative disentanglement metrics (e.g., mutual information or correlation between pitch/yaw directions), or ablation isolating the regularization’s effect on subspace purity are provided. Because every downstream claim (low-label performance, cross-domain gains) rests on this assumption, the attribution of results to the proposed mechanism remains unverified.

Authors: We agree that direct evidence for the disentanglement effect would strengthen the paper. In the revision we will add (i) visualizations of the feature subspaces, (ii) quantitative metrics including correlation and mutual-information scores between each subspace and the corresponding gaze component, and (iii) an ablation that isolates the Jacobian term’s contribution to subspace purity and downstream performance. revision: yes
Referee: [Experiments] Experimental section / abstract: The manuscript asserts competitive results at 5 %, 10 %, and 20 % label fractions on multiple benchmarks, yet the provided description contains no quantitative tables, ablation studies on the Jacobian term, error-bar statistics, or statistical significance tests. Without these, the empirical support for the central performance claim cannot be assessed and the “plug-and-play” assertion cannot be evaluated.

Authors: We will expand the experimental section with (i) full quantitative tables for the 5 %, 10 %, and 20 % label regimes on all reported benchmarks, (ii) a dedicated ablation on the Jacobian regularization term, (iii) error bars computed over multiple random seeds, and (iv) statistical significance tests against the baselines. These additions will allow direct evaluation of the performance claims and the plug-and-play property. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation is self-contained and independent of evaluation results.

full rationale

The paper presents a semi-supervised architecture that applies Jacobian regularization to encourage subspace disentanglement followed by contrastive learning on intrinsic ordinal rankings. No quoted equations, definitions, or claims reduce any output quantity to an input by construction, nor do any 'predictions' collapse to fitted parameters on the same data. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The method description stands independently of the reported benchmark numbers, which are obtained from external evaluation sets rather than being statistically forced by the training procedure itself. This is the normal case of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method description implies standard loss-weighting hyperparameters whose values are not stated.

pith-pipeline@v0.9.1-grok · 5744 in / 1102 out tokens · 19278 ms · 2026-06-29T18:15:34.080488+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 2 canonical work pages · 1 internal anchor

[1]

and Lu, F

Cheng, Y . and Lu, F. Gaze estimation using transformer. In 2022 26th International Conference on Pattern Recogni- tion (ICPR), pp. 3341–3347. IEEE,

2022
[2]

X., and Bulling, A

Steil, J., Huang, M. X., and Bulling, A. Fixation detection for head-mounted eye tracking based on visual similarity of gaze targets. InProceedings of the 2018 ACM sympo- sium on eye tracking research & applications, pp. 1–9,

2018
[3]

Yi, D., Lei, Z., Liao, S., and Li, S. Z. Learning face repre- sentation from scratch.arXiv preprint arXiv:1411.7923,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Following the preprocessing steps in (Cheng et al., 2024), we exclude images where the subject’s face is not visible

was collected in both indoor and outdoor environments, comprising labeled 3D gaze data from 238 subjects with a diverse range of head poses and gaze directions. Following the preprocessing steps in (Cheng et al., 2024), we exclude images where the subject’s face is not visible. The remaining data is divided into a training set of 84,902 images, which is f...

2024
[5]

MPIIGaze(Zhang et al., 2017a) was collected from 15 subjects in unconstrained real-world environments

And 16,031 images are served as the test set for in-domain evaluations. MPIIGaze(Zhang et al., 2017a) was collected from 15 subjects in unconstrained real-world environments. Adhering to the standard evaluation protocol, we select a subset of 3,000 images from each subject. For cross-domain evaluations, consistent with previous works (Cheng et al., 2022; ...

2022
[6]

Adhering to the protocol in (Cheng et al., 2024), we select 16,674 images from 14 subjects to serve as the target domain for cross-domain evaluations

consists of video clips recorded from 16 subjects, where gaze targets are defined by either screen targets or 3D floating balls. Adhering to the protocol in (Cheng et al., 2024), we select 16,674 images from 14 subjects to serve as the target domain for cross-domain evaluations. C. Extension to Other Multi-Target Regression Tasks Beyond the gaze estimatio...

2024
[7]

The hyperparameters γ, wSC, wUC and wUR are set to 1.0, 1.0, 0.05, 0.01

We utilize Adam optimizer with learning rate 2e−5 during the training. The hyperparameters γ, wSC, wUC and wUR are set to 1.0, 1.0, 0.05, 0.01. We randomly select samples for Dsprites dataset, and 13 Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning Table 7.Detailed Statistics of Datasets Dataset # of instance # of feature # o...

work page arXiv
[8]

and PnP-GA (Liu et al., 2021), and then inject our DSCL into them, results are shown in Figure

2021

[1] [1]

and Lu, F

Cheng, Y . and Lu, F. Gaze estimation using transformer. In 2022 26th International Conference on Pattern Recogni- tion (ICPR), pp. 3341–3347. IEEE,

2022

[2] [2]

X., and Bulling, A

Steil, J., Huang, M. X., and Bulling, A. Fixation detection for head-mounted eye tracking based on visual similarity of gaze targets. InProceedings of the 2018 ACM sympo- sium on eye tracking research & applications, pp. 1–9,

2018

[3] [3]

Yi, D., Lei, Z., Liao, S., and Li, S. Z. Learning face repre- sentation from scratch.arXiv preprint arXiv:1411.7923,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Following the preprocessing steps in (Cheng et al., 2024), we exclude images where the subject’s face is not visible

was collected in both indoor and outdoor environments, comprising labeled 3D gaze data from 238 subjects with a diverse range of head poses and gaze directions. Following the preprocessing steps in (Cheng et al., 2024), we exclude images where the subject’s face is not visible. The remaining data is divided into a training set of 84,902 images, which is f...

2024

[5] [5]

MPIIGaze(Zhang et al., 2017a) was collected from 15 subjects in unconstrained real-world environments

And 16,031 images are served as the test set for in-domain evaluations. MPIIGaze(Zhang et al., 2017a) was collected from 15 subjects in unconstrained real-world environments. Adhering to the standard evaluation protocol, we select a subset of 3,000 images from each subject. For cross-domain evaluations, consistent with previous works (Cheng et al., 2022; ...

2022

[6] [6]

Adhering to the protocol in (Cheng et al., 2024), we select 16,674 images from 14 subjects to serve as the target domain for cross-domain evaluations

consists of video clips recorded from 16 subjects, where gaze targets are defined by either screen targets or 3D floating balls. Adhering to the protocol in (Cheng et al., 2024), we select 16,674 images from 14 subjects to serve as the target domain for cross-domain evaluations. C. Extension to Other Multi-Target Regression Tasks Beyond the gaze estimatio...

2024

[7] [7]

The hyperparameters γ, wSC, wUC and wUR are set to 1.0, 1.0, 0.05, 0.01

We utilize Adam optimizer with learning rate 2e−5 during the training. The hyperparameters γ, wSC, wUC and wUR are set to 1.0, 1.0, 0.05, 0.01. We randomly select samples for Dsprites dataset, and 13 Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning Table 7.Detailed Statistics of Datasets Dataset # of instance # of feature # o...

work page arXiv

[8] [8]

and PnP-GA (Liu et al., 2021), and then inject our DSCL into them, results are shown in Figure

2021