Auditing Demographic Bias in Facial Landmark Detection for Fair Human-Robot Interaction

Jos\'e M. Buenaposada; Luis Baumela; Pablo Parte; Roberto Valle

arxiv: 2604.06961 · v2 · pith:3QPZ3RBFnew · submitted 2026-04-08 · 💻 cs.CV

Auditing Demographic Bias in Facial Landmark Detection for Fair Human-Robot Interaction

Pablo Parte , Roberto Valle , Jos\'e M. Buenaposada , Luis Baumela This is my paper

Pith reviewed 2026-05-10 18:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords facial landmark detectiondemographic biashuman-robot interactionage biasfairnessconfounding factorscomputer visionrobot perception

0 comments

The pith

After controlling for head pose and image resolution, facial landmark detectors show no bias by gender or race but retain an age-related bias.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits demographic biases in facial landmark detection, a foundational step for robots interpreting human faces in interaction. It applies a statistical approach to isolate the effects of age, gender, and race from other image properties such as head orientation and picture clarity. Once those other factors are held constant, performance gaps between genders and racial groups disappear entirely. A remaining difference tied to age persists, with detectors making larger errors on older faces. This matters because even small inaccuracies at this early stage can lead to unequal outcomes when robots respond to people, particularly affecting older users.

Core claim

The authors establish that confounding visual factors, especially head pose and image resolution, account for most observed variations in facial landmark detection accuracy. After applying controls to remove these influences, disparities across gender and race are no longer statistically detectable. A significant age effect remains, however, with older individuals experiencing systematically higher landmark placement errors. The work concludes that such biases in low-level vision models can carry forward into the broader human-robot interaction pipeline and disproportionately impact vulnerable groups.

What carries the argument

A controlled statistical methodology that disentangles demographic attributes from confounding visual factors such as head pose and image resolution.

If this is right

Fairness in human-robot interaction requires auditing low-level perception components, not only high-level analysis tasks.
Age-specific performance differences in landmark detection can propagate to downstream robot behaviors that affect older users.
Correcting for visual confounders like pose and resolution is necessary to achieve equitable perception systems.
Vulnerable populations may experience reduced robot reliability due to these persistent age-related inaccuracies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training data for landmark detectors should be deliberately balanced by age to reduce the remaining bias.
The auditing method could be extended to other basic computer vision tasks to check for similar hidden demographic effects.
In practice, robots might need supplementary age-aware processing stages to compensate for landmark errors on older faces.

Load-bearing premise

The statistical controls fully separate demographic effects from visual confounders without leaving residual dataset-specific interactions or unmeasured variables that could create or hide biases.

What would settle it

Running the same landmark detector on a fresh dataset where age, gender, race, head pose, and resolution are independently balanced and varied, then checking whether the age-related error gap remains or whether gender and race gaps reappear.

Figures

Figures reproduced from arXiv: 2604.06961 by Jos\'e M. Buenaposada, Luis Baumela, Pablo Parte, Roberto Valle.

read the original abstract

Fairness in human-robot interaction critically depends on the reliability of the perceptual models that enable robots to interpret human behavior. While demographic biases have been widely studied in high-level facial analysis tasks, their presence in facial landmark detection remains unexplored. In this paper, we conduct a systematic audit of demographic bias in this task, analyzing the age, gender, and race biases. To this end, we introduce a controlled statistical methodology to disentangle demographic effects from confounding visual factors. Our analysis demonstrates that visual confounders, particularly head pose and face resolution, heavily outweigh the impact of demographic attributes. Notably, after accounting for these confounders, performance disparities across gender and race vanish. However, we identify a statistically significant age-related bias, with higher localization errors for older individuals. This shows that fairness issues can emerge even in low-level vision components and can propagate through the HRI pipeline. We argue that auditing and correcting such biases is a necessary step toward trustworthy and equitable robot perception systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The audit shows age bias persisting in landmark detection after controls, while gender and race effects are explained by confounders, but the controls' robustness is the open question.

read the letter

The paper's central result is that once head pose and image resolution are controlled for, performance gaps by gender and race in facial landmark detection go away, but a statistically significant age effect remains, with older individuals showing higher bias. This points to fairness problems starting early in the vision pipeline for robots. The new element is the focus on landmark detection itself rather than end tasks like recognition. They bring in a controlled statistical methodology to try to isolate demographic influences from visual confounders, which helps make the audit more precise than simple group comparisons. The work is solid in showing the dominance of those confounders and in highlighting the disproportionate impact on older users in human-robot interaction scenarios. It uses a standard model for the evaluation, keeping things concrete. Where it is softer is in the details of that control method. The stress test raises a fair point: if the regression does not include interaction terms between demographics and the confounders, or if non-linear effects are present, then the vanishing of gender and race effects and the persistence of age could be misleading. The paper would benefit from more transparency on the exact statistical model, dataset characteristics, and any validation of the disentanglement. This paper is aimed at the intersection of computer vision fairness and robotics. A reader building or evaluating perceptual systems for HRI will get practical insights from the audit approach and the age finding, provided they review the methods carefully. It deserves peer review. The topic is relevant and the empirical nature makes it straightforward to assess, though revisions on the statistical robustness would be expected. I recommend sending it to referees.

Referee Report

1 major / 0 minor

Summary. The paper audits demographic biases (age, gender, race) in facial landmark detection for human-robot interaction. It introduces a controlled statistical methodology to disentangle demographic effects from visual confounders such as head pose and image resolution. Evaluations on a standard model show that confounders outweigh demographics; after controls, gender and race disparities vanish while a statistically significant age bias remains, with higher errors for older individuals. The work argues this demonstrates fairness issues in low-level vision components that can propagate in HRI pipelines.

Significance. If the disentanglement holds, the result is significant for computer vision and HRI: it shows that bias auditing is needed even for foundational perceptual tasks like landmark detection, not just high-level recognition, and provides a concrete methodology plus empirical evidence that age effects persist independently. This could inform design of equitable robot perception systems and highlights vulnerable populations.

major comments (1)

[Methods / Results (controlled statistical methodology)] The controlled statistical methodology (described in the methods and results sections) does not report inclusion of interaction terms (demographics × head pose or demographics × resolution) or tests for non-linearity in the regression. If omitted, the reported vanishing of gender/race effects and retention of the age effect could be artifacts of model misspecification rather than successful isolation of demographic bias, directly affecting the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment on the controlled statistical methodology below and will revise the paper accordingly to strengthen the analysis.

read point-by-point responses

Referee: [Methods / Results (controlled statistical methodology)] The controlled statistical methodology (described in the methods and results sections) does not report inclusion of interaction terms (demographics × head pose or demographics × resolution) or tests for non-linearity in the regression. If omitted, the reported vanishing of gender/race effects and retention of the age effect could be artifacts of model misspecification rather than successful isolation of demographic bias, directly affecting the central claim.

Authors: We appreciate the referee's observation regarding potential model misspecification. Our regression analysis used a linear model with main effects for demographics (age, gender, race) and confounders (head pose, resolution) to isolate demographic contributions after controlling for visual factors. Interaction terms and explicit non-linearity tests were not included in the reported results, as the primary goal was to evaluate whether demographic effects remain significant beyond the main confounding influences. We acknowledge that this could limit the ability to detect moderated effects (e.g., age-specific sensitivity to pose). In the revised manuscript, we will incorporate interaction terms between each demographic variable and the confounders, along with tests for non-linearity (such as quadratic terms for continuous confounders and appropriate diagnostics). We will report these extended results to verify that the vanishing of gender and race effects, as well as the persistent age effect, hold under the more comprehensive specification. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical audit with independent statistical evaluation

full rationale

The paper performs an empirical audit of a standard facial landmark detector on demographic subgroups, using regression-based controls for confounders (head pose, resolution). No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. Results are obtained by direct evaluation and statistical testing against held-out data rather than by construction from the inputs themselves. The methodology is self-contained and falsifiable via external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard statistical assumptions for controlling confounders and the representativeness of the evaluated model and dataset. No free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Statistical controls for head pose and image resolution can fully isolate demographic effects without residual confounding or interaction terms
This underpins the claim that gender and race disparities vanish after accounting for confounders while age bias remains.

pith-pipeline@v0.9.0 · 5483 in / 1271 out tokens · 63143 ms · 2026-05-10T18:02:22.400510+00:00 · methodology

Auditing Demographic Bias in Facial Landmark Detection for Fair Human-Robot Interaction

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)