Cognitive-Causal Multi-Task Learning with Psychological State Conditioning for Assistive Driving Perception

Akira Imanishi; Keito Inoshita; Nobuhiro Hayashida

arxiv: 2604.07651 · v1 · submitted 2026-04-08 · 💻 cs.LG · cs.AI

Cognitive-Causal Multi-Task Learning with Psychological State Conditioning for Assistive Driving Perception

Keito Inoshita , Nobuhiro Hayashida , Akira Imanishi This is my paper

Pith reviewed 2026-05-10 17:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multi-task learningcausal multi-task learningdriver assistance systemspsychological conditioningcognitive causal structureassistive driving perceptionemotion recognitionbehavior recognition

0 comments

The pith

CauPsi models cognitive task dependencies in driving perception by chaining context recognition to emotion and behavior via prototype embeddings and conditioning all tasks on estimated driver psychological states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CauPsi to address flat, independent treatment of recognition tasks in driver assistance systems by explicitly modeling the cognitive causal structure among traffic context recognition, vehicle context recognition, driver emotion recognition, and driver behavior recognition. It uses a Causal Task Chain to propagate upstream predictions downstream through learnable prototype embeddings in a differentiable way. It adds Cross-Task Psychological Conditioning that derives a psychological state signal from facial expressions and body posture and applies it to condition every task, including environmental ones. On the AIDE dataset this yields 82.71 percent mean accuracy using 5.05 million parameters, a 1.0 percent gain over prior work with larger gains on emotion and behavior tasks. Ablations confirm each mechanism adds value independently while the state signal learns task-dependent patterns without any explicit psychological labels.

Core claim

We propose CauPsi, a cognitive science-grounded causal multi-task learning framework that explicitly models the hierarchical dependencies among TCR, VCR, DER, and DBR. The framework introduces a Causal Task Chain that propagates upstream task predictions to downstream tasks via learnable prototype embeddings, realizing the cognitive cascade from environmental perception to behavioral regulation in a differentiable manner, and Cross-Task Psychological Conditioning that estimates a psychological state signal from driver facial expressions and body posture and injects it as a conditioning input to all tasks.

What carries the argument

Causal Task Chain that propagates upstream predictions via learnable prototype embeddings, together with Cross-Task Psychological Conditioning that injects a self-supervised psychological state signal estimated from facial expressions and body posture.

If this is right

Mean accuracy reaches 82.71 percent and exceeds prior work by 1.0 percent overall while using only 5.05 million parameters.
Accuracy on driver emotion recognition rises by 3.65 percent and on driver behavior recognition by 7.53 percent.
Ablation studies show that each of the two mechanisms contributes independently to the gains.
The psychological state signal develops systematic task-label-dependent patterns through self-supervision alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the causal chaining holds, the same prototype-propagation approach could be applied to other sequential multi-task settings where perception precedes action, such as robotic manipulation or medical image workflows.
The self-supervised conditioning mechanism suggests a route to improve perception models in any domain where internal human state modulates external recognition without requiring new labeled data.
One could test the claimed cognitive cascade by intervening on upstream task outputs at inference time and measuring whether downstream accuracy changes as predicted by the chain.

Load-bearing premise

The assumption that hierarchical cognitive dependencies among the four tasks can be realized by learnable prototype embeddings and that a psychological state estimated from expressions and posture meaningfully conditions environmental recognition without explicit annotations.

What would settle it

An ablation on the AIDE dataset in which removing either the Causal Task Chain or the psychological conditioning produces no accuracy drop on downstream tasks or on environmental recognition, or in which the learned state signal shows no systematic correlation with task labels.

Figures

Figures reproduced from arXiv: 2604.07651 by Akira Imanishi, Keito Inoshita, Nobuhiro Hayashida.

**Figure 1.** Figure 1: Overall architecture of CauPsi. driver facial expressions and body posture and injects it as a conditioning input to all tasks; iii) a Causal Task Chain that explicitly models inter-task causal dependencies via prototype embeddings; and iv) loss functions and training stabilization techniques. The overall processing flow is as follows. Each view’s video is processed by a frozen pre-trained encoder and temp… view at source ↗

**Figure 2.** Figure 2: Detailed architecture of the Causal Task Chain. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Mean values across the 16 dimensions of ψ, stratified by class for each task. predicts solely from its task-specific projection zr and scene features, unable to exploit upstream cognitive outputs. Removing CTPC (−0.94%) degrades VCR (−1.65%) and DBR (−2.95%), corroborating that ψ contributes to both environmental and behavioral recognition. Cross-View Attention contributes modestly overall (−0.33%) but sho… view at source ↗

**Figure 4.** Figure 4: Normalized confusion matrices for all four tasks. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

Multi-task learning for advanced driver assistance systems requires modeling the complex interplay between driver internal states and external traffic environments. However, existing methods treat recognition tasks as flat and independent objectives, failing to exploit the cognitive causal structure underlying driving behavior. In this paper, we propose CauPsi, a cognitive science-grounded causal multi-task learning framework that explicitly models the hierarchical dependencies among Traffic Context Recognition (TCR), Vehicle Context Recognition (VCR), Driver Emotion Recognition (DER), and Driver Behavior Recognition (DBR). The proposed framework introduces two key mechanisms. First, a Causal Task Chain propagates upstream task predictions to downstream tasks via learnable prototype embeddings, realizing the cognitive cascade from environmental perception to behavioral regulation in a differentiable manner. Second, Cross-Task Psychological Conditioning (CTPC) estimates a psychological state signal from driver facial expressions and body posture and injects it as a conditioning input to all tasks including environmental recognition, thereby modeling the modulatory effect of driver internal states on cognitive and decision-making processes. Evaluated on the AIDE dataset, CauPsi achieves a mean accuracy of 82.71% with only 5.05M parameters, surpassing prior work by +1.0% overall, with notable improvements on DER (+3.65%) and DBR (+7.53%). Ablation studies validate the independent contribution of each component, and analysis of the psychological state signal confirms that it acquires systematic task-label-dependent patterns in a self-supervised manner without explicit psychological annotations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CauPsi chains driving tasks causally with prototype embeddings and adds self-supervised psychological conditioning, but the accuracy gains stay small and rest on thin evidence from one dataset.

read the letter

The paper's core move is to treat TCR, VCR, DER, and DBR as a cognitive cascade rather than flat tasks. It uses learnable prototypes to pass upstream predictions downstream in a differentiable way, then injects a psychological state signal—pulled from faces and posture without explicit labels—into every task via CTPC. On the AIDE dataset this yields 82.71% mean accuracy with a 5M-parameter model, a 1% lift overall and larger jumps on the driver-side tasks. Ablations are reported to show each piece contributes separately, which is useful to see. The small model size and the attempt to make the hierarchy explicit are the parts that stand out as concrete engineering choices. The main weaknesses are the lack of error bars, significance tests, or split details, so the reported deltas could easily be within noise. The self-supervised psychological signal is derived from the same task labels it is meant to modulate, which creates a real risk that it is just capturing dataset correlations instead of genuine modulatory effects. The assumed task hierarchy is also not compared against standard multi-task baselines that ignore the cognitive story. This is aimed at researchers working on multi-task perception for driver assistance or other safety-critical human-AI settings. Someone looking for a structured way to fold internal-state conditioning into recognition pipelines could extract usable ideas, but the current results are too preliminary to treat as settled. Send it to peer review so the authors can add statistical controls, more datasets, and direct comparisons to non-causal multi-task models.

Referee Report

3 major / 2 minor

Summary. The paper proposes CauPsi, a cognitive-causal multi-task learning framework for assistive driving perception tasks. It models hierarchical dependencies among Traffic Context Recognition (TCR), Vehicle Context Recognition (VCR), Driver Emotion Recognition (DER), and Driver Behavior Recognition (DBR) via a Causal Task Chain that propagates upstream predictions to downstream tasks using learnable prototype embeddings. It further introduces Cross-Task Psychological Conditioning (CTPC) to estimate a psychological state signal from facial expressions and body posture (self-supervised, without explicit annotations) and inject it to condition all tasks. On the AIDE dataset, CauPsi reports 82.71% mean accuracy using 5.05M parameters, outperforming prior work by +1.0% overall with gains of +3.65% on DER and +7.53% on DBR; ablations are claimed to validate each component's contribution, and analysis shows the psychological signal acquires task-label-dependent patterns.

Significance. If the central claims hold under rigorous validation, the work would be significant for multi-task learning in ADAS by explicitly incorporating cognitive science principles of causal hierarchies and internal state modulation, potentially improving robustness in real-world driving scenarios. The low parameter count (5.05M) is a practical strength for deployment. The self-supervised CTPC without psychological annotations and the differentiable Causal Task Chain represent innovative bridges between cognitive models and ML; the reported ablations provide some evidence for component independence, which strengthens the case if statistical controls are added.

major comments (3)

[§4] §4 (Experiments), Table 1: The central performance claim of 82.71% mean accuracy and +1.0% improvement (with specific +3.65% DER, +7.53% DBR gains) is reported without error bars, standard deviations across runs, dataset split details, or statistical significance tests; this is load-bearing as it prevents assessment of whether gains are reliable or due to variance.
[§3.2] §3.2 (CTPC), Eq. (5) and surrounding text: The psychological state is learned self-supervised from task labels and injected to condition upstream environmental tasks (TCR/VCR); this creates a circularity risk where the signal may fit to downstream label correlations in AIDE rather than providing independent causal modulation, as the ablations do not include controls for information leakage or alternative conditioning baselines.
[§3.1] §3.1 (Causal Task Chain): The claim that learnable prototype embeddings realize hierarchical cognitive dependencies among TCR/VCR/DER/DBR lacks explicit controls for information leakage from downstream labels or comparisons to standard multi-task baselines (e.g., shared backbone without prototypes); this is central to validating the 'cognitive-causal' structure over flat multi-task learning.

minor comments (2)

[Abstract] The abstract and §1 could more clearly distinguish the self-supervised psychological signal from potential dataset biases in AIDE collection.
[Figure 2] Figure 2 (framework diagram) would benefit from explicit arrows or labels showing how prototype embeddings propagate predictions and how CTPC conditions environmental tasks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of experimental rigor and methodological validation that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [§4] §4 (Experiments), Table 1: The central performance claim of 82.71% mean accuracy and +1.0% improvement (with specific +3.65% DER, +7.53% DBR gains) is reported without error bars, standard deviations across runs, dataset split details, or statistical significance tests; this is load-bearing as it prevents assessment of whether gains are reliable or due to variance.

Authors: We agree that reporting variability and statistical significance is essential for assessing the reliability of the reported gains. Section 4.1 already describes the AIDE dataset splits (70/15/15 train/val/test), but we will expand this with explicit details on random seeds and preprocessing. In the revised manuscript we will report mean accuracy and standard deviation over five independent runs with different random seeds, and include paired t-tests (with p-values) comparing CauPsi against each baseline to establish statistical significance of the +1.0% overall, +3.65% DER, and +7.53% DBR improvements. revision: yes
Referee: [§3.2] §3.2 (CTPC), Eq. (5) and surrounding text: The psychological state is learned self-supervised from task labels and injected to condition upstream environmental tasks (TCR/VCR); this creates a circularity risk where the signal may fit to downstream label correlations in AIDE rather than providing independent causal modulation, as the ablations do not include controls for information leakage or alternative conditioning baselines.

Authors: We appreciate the referee’s concern regarding potential circularity. The CTPC module extracts the psychological state exclusively from driver facial and postural features; task labels influence the state only indirectly through the joint loss, not by direct access to downstream labels during conditioning. Nevertheless, to rule out leakage we will add three new controls in the revision: (1) an ablation replacing the learned psychological vector with random Gaussian noise of identical dimension, (2) a version that conditions only downstream tasks while leaving TCR/VCR unconditioned, and (3) a comparison against a standard FiLM conditioning baseline that does not interpret the signal as psychological. These experiments will be reported alongside the existing ablations. revision: yes
Referee: [§3.1] §3.1 (Causal Task Chain): The claim that learnable prototype embeddings realize hierarchical cognitive dependencies among TCR/VCR/DER/DBR lacks explicit controls for information leakage from downstream labels or comparisons to standard multi-task baselines (e.g., shared backbone without prototypes); this is central to validating the 'cognitive-causal' structure over flat multi-task learning.

Authors: We acknowledge that the current ablation (Table 2, row “w/o Causal Task Chain”) removes prototype propagation but does not fully isolate leakage or compare against a pure shared-backbone multi-task model. In the revision we will add: (i) a flat multi-task baseline that shares the backbone and heads but omits the prototype chain entirely, (ii) an experiment in which downstream labels are masked when computing prototype embeddings (so only upstream predictions are used), and (iii) t-SNE visualizations of the learned prototypes to illustrate that they encode the intended hierarchical relations. These additions will provide stronger evidence that the performance gains stem from the cognitive-causal structure rather than label leakage. revision: yes

Circularity Check

1 steps flagged

CTPC psychological state learned self-supervised from task labels, reducing conditioning to fitted input

specific steps

fitted input called prediction [Abstract]
"analysis of the psychological state signal confirms that it acquires systematic task-label-dependent patterns in a self-supervised manner without explicit psychological annotations."

The psychological state is fitted directly to the task labels (TCR/VCR/DER/DBR) via self-supervision and then injected as a conditioning input to predict those same tasks, making the 'modulatory effect' equivalent to a learned auxiliary signal from the targets rather than an independent psychological prior.

full rationale

The paper's core claim rests on CTPC estimating a psychological state without annotations and using it to condition environmental tasks (TCR/VCR) as well as DER/DBR. However, the abstract explicitly states this signal 'acquires systematic task-label-dependent patterns in a self-supervised manner.' This makes the state a learned function of the downstream labels it is supposed to modulate, rather than an independent causal factor. The Causal Task Chain with prototypes is presented as realizing cognitive dependencies differentiably but is an architectural ansatz without external uniqueness proof. No self-citations or imported uniqueness theorems appear in the provided text. The result shows moderate circularity risk because gains on DER/DBR may stem from joint fitting rather than verified causal structure. Overall derivation is not fully self-contained against the input labels.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on a domain assumption about cognitive hierarchy in driving and introduces one invented entity (the psychological state signal) whose only evidence is performance improvement; no free parameters are explicitly listed in the abstract.

axioms (1)

domain assumption Driving behavior follows a cognitive causal hierarchy from environmental perception through internal states to behavioral regulation.
The framework is explicitly built on this cognitive-science-grounded structure to justify the task chain and conditioning.

invented entities (1)

Psychological state signal no independent evidence
purpose: To estimate driver internal state from facial and posture cues and inject it as conditioning input to all recognition tasks.
The signal is learned without explicit psychological annotations and is claimed to acquire task-label-dependent patterns.

pith-pipeline@v0.9.0 · 5574 in / 1356 out tokens · 64460 ms · 2026-05-10T17:28:22.212433+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

doi: 10.1080/02699938708408043. M. Gao, J.-Y . Li, C.-H. Chen, Y . Li, J. Zhang, and Z.-H. Zhan. Enhanced multi-task learning and knowledge graph-based recommender system.IEEE Transactions on Knowledge and Data Engineering, 35(10):10281–10294, 2023. doi: 10.1109/tkde.2023.3251897. Y . Gong, J. Lu, W. Liu, Z. Li, X. Jiang, X. Gao, et al. SIFDriveNet: Speed...

work page doi:10.1080/02699938708408043 2023
[2]

URL10.1109/CVPR52688.2022.00320. N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. InComputer Vision – ECCV 2018, Lecture Notes in Computer Science, pages 122–138, 2018. doi: 10.1007/978-3-030-01264-9_8. A. Moors. On the causal role of appraisal in emotion.Emotion Review, 5(2):132–140, 201...

work page doi:10.1007/978-3-030-01264-9_8 2022

[1] [1]

doi: 10.1080/02699938708408043. M. Gao, J.-Y . Li, C.-H. Chen, Y . Li, J. Zhang, and Z.-H. Zhan. Enhanced multi-task learning and knowledge graph-based recommender system.IEEE Transactions on Knowledge and Data Engineering, 35(10):10281–10294, 2023. doi: 10.1109/tkde.2023.3251897. Y . Gong, J. Lu, W. Liu, Z. Li, X. Jiang, X. Gao, et al. SIFDriveNet: Speed...

work page doi:10.1080/02699938708408043 2023

[2] [2]

URL10.1109/CVPR52688.2022.00320. N. Ma, X. Zhang, H.-T. Zheng, and J. Sun. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. InComputer Vision – ECCV 2018, Lecture Notes in Computer Science, pages 122–138, 2018. doi: 10.1007/978-3-030-01264-9_8. A. Moors. On the causal role of appraisal in emotion.Emotion Review, 5(2):132–140, 201...

work page doi:10.1007/978-3-030-01264-9_8 2022