Editing Physiological Signals in Videos Using Latent Representations

Akshay Paruchuri; Josef Spjut; Kaan Ak\c{s}it; Tianwen Zhou

arxiv: 2509.25348 · v3 · submitted 2025-09-29 · 💻 cs.CV · cs.HC· cs.MM

Editing Physiological Signals in Videos Using Latent Representations

Tianwen Zhou , Akshay Paruchuri , Josef Spjut , Kaan Ak\c{s}it This is my paper

Pith reviewed 2026-05-18 11:59 UTC · model grok-4.3

classification 💻 cs.CV cs.HCcs.MM

keywords physiological signal editingremote photoplethysmographyvideo latent editingheart rate modulation3D variational autoencoderprivacy protectionFiLM conditioning

0 comments

The pith

Physiological signals such as heart rate can be edited in facial videos by modulating their latent representations from a 3D VAE while preserving visual quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a framework for editing physiological signals in videos to mitigate privacy risks from visible vital signs. Videos are encoded into a latent space using a pretrained 3D variational autoencoder, and a target heart rate is introduced via a text prompt. These are fused in trainable layers that maintain temporal coherence, with FiLM used in decoding to adjust the signal. The approach demonstrates effective modulation alongside high visual fidelity on tested datasets.

Core claim

The paper establishes that fusing latent video encodings from a 3D VAE with target heart rate prompts through AdaLN-based spatio-temporal layers and FiLM decoding enables accurate physiological editing in reconstructed videos without substantial visual degradation.

What carries the argument

Spatio-temporal fusion layers with Adaptive Layer Normalizations that integrate video latents and HR embeddings before FiLM-based decoding.

If this is right

The method supports anonymizing personal health information in shared videos.
It facilitates the creation of videos featuring specific physiological states for research or training.
Reconstructed videos retain high visual quality as measured by PSNR and SSIM metrics.
Modulated heart rates closely match targets according to rPPG estimation errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar techniques might apply to editing other time-varying signals like facial expressions if their temporal structure is preserved in the latent space.
This opens possibilities for controlled data augmentation in machine learning models for physiological monitoring.
Validation on a broader range of video sources and conditions would strengthen the findings.

Load-bearing premise

The 3D VAE's latent representations retain the necessary temporal dynamics of physiological signals for successful modulation.

What would settle it

Reconstructing videos with the proposed method and then using an independent rPPG estimator to check if the heart rate matches the input prompt within the reported average errors, or assessing if visual metrics fall below the claimed thresholds.

read the original abstract

Camera-based physiological signal estimation provides a non-contact and convenient means to monitor Heart Rate (HR). However, the presence of vital signals in facial videos raises significant privacy concerns, as they can reveal sensitive personal information related to the health and emotional states of an individual. To address this, we propose a learned framework that edits physiological signals in videos while preserving visual fidelity. First, we encode an input video into a latent space via a pretrained 3D Variational Autoencoder (3D VAE), while a target HR prompt is embedded through a frozen text encoder. We fuse them using a set of trainable spatio-temporal layers with Adaptive Layer Normalizations (AdaLN) to capture the strong temporal coherence of remote Photoplethysmography (rPPG) signals. We apply Feature-wise Linear Modulation (FiLM) in the decoder with a fine-tuned output layer to avoid the degradation of physiological signals during reconstruction, enabling accurate physiological modulation in the reconstructed video. Empirical results show that our method preserves visual quality with an average PSNR of 38.96 dB and SSIM of 0.98 on selected datasets, while achieving an average HR modulation error of 10.00 bpm MAE and 10.09% MAPE using a state-of-the-art rPPG estimator. Our design's controllable HR editing is useful for applications such as anonymizing biometric signals in real videos or synthesizing realistic videos with desired vital signs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a way to edit heart rate in face videos by fusing text prompts into 3D VAE latents with AdaLN and FiLM layers, but the results may depend on the VAE actually keeping the subtle rPPG timing information.

read the letter

The main thing to know is that this work encodes input videos with a frozen pretrained 3D VAE, embeds a target heart rate as text, fuses them through trainable spatio-temporal AdaLN layers, and decodes with FiLM conditioning plus a fine-tuned output layer to change the apparent HR while trying to hold visual quality steady. The reported numbers are concrete: average PSNR of 38.96 dB, SSIM of 0.98, and HR modulation error around 10 bpm MAE and 10% MAPE when checked with a separate rPPG estimator. That combination of text prompting and modulation layers for physiological control is the specific new piece, extending past standard rPPG estimation or generic video editing. The focus on temporal coherence for rPPG signals through those layers is a reasonable design choice for the stated goal of privacy or synthetic data uses. The soft spot is the load-bearing assumption that the general-purpose 3D VAE latents retain the low-amplitude, periodic skin intensity fluctuations that rPPG needs. If the encoder averages or drops those micro-temporal details, the method could still produce high visual scores by synthesizing plausible faces while the HR prompt simply biases the output appearance, and the downstream estimator registers a change without the video holding the real signal. The paper would be stronger with direct checks on whether rPPG information survives the encoding step or survives across different estimators. Datasets and full ablation details are not in the abstract but presumably appear in the full text. This is aimed at CV and HCI people working on remote monitoring privacy or controllable video synthesis. A reader interested in practical biometric editing tools would get something usable from the fusion technique. It deserves peer review because the approach is technically coherent and the empirical claims are specific enough to benefit from referee scrutiny on the latent preservation question.

Referee Report

2 major / 2 minor

Summary. The paper proposes a framework for editing heart rate (HR) signals in facial videos to address privacy concerns while preserving visual fidelity. It encodes input videos using a frozen pretrained 3D VAE, embeds target HR prompts via a frozen text encoder, fuses them with trainable spatio-temporal layers incorporating Adaptive Layer Normalization (AdaLN), and decodes using Feature-wise Linear Modulation (FiLM) with a fine-tuned output layer. Empirical results report average PSNR of 38.96 dB, SSIM of 0.98, and HR modulation errors of 10.00 bpm MAE and 10.09% MAPE measured by a state-of-the-art rPPG estimator on selected datasets.

Significance. If the central empirical claims hold under rigorous validation, the work could enable practical applications in biometric anonymization and controlled synthesis of physiological signals for training remote PPG models. The latent-space editing approach with prompt fusion offers a controllable alternative to direct signal manipulation techniques.

major comments (2)

[Method (encoding and fusion pipeline)] The method assumes that the frozen pretrained 3D VAE latent space retains the low-amplitude, periodic skin-intensity fluctuations required for accurate rPPG-based HR modulation (see the encoding and fusion steps in the proposed pipeline). A general-purpose 3D VAE is typically optimized for perceptual reconstruction rather than preserving micro-temporal signals; temporal downsampling or averaging in the encoder could erase these components, causing the reported 10 bpm MAE to reflect decoder bias or estimator artifacts rather than genuine physiological editing. This assumption is load-bearing for the core claim, as all subsequent AdaLN fusion and FiLM decoding operate on the surviving latent information. The manuscript should include targeted analysis (e.g., rPPG signal recovery from latents before/after encoding or comparison against an rPPG-aware encoder) to substantiate preservation.
[Experiments and Results] The evaluation reports quantitative metrics (PSNR, SSIM, MAE/MAPE) but provides insufficient protocol details in the abstract and experimental sections, including exact datasets used, baseline methods, ablation studies on the trainable AdaLN layers and fine-tuned decoder, and independence verification of the rPPG estimator from the editing pipeline. Without these, the cross-dataset claims and the 10.09% MAPE result cannot be fully assessed for robustness or potential circularity.

minor comments (2)

[Experiments] Clarify the exact composition of 'selected datasets' and provide a table summarizing per-dataset metrics rather than averages only.
[Method] The notation for the fusion module (AdaLN spatio-temporal layers) and FiLM conditioning could be formalized with equations to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments highlight important aspects of methodological assumptions and experimental rigor that we address below. We have prepared revisions to strengthen the paper accordingly.

read point-by-point responses

Referee: The method assumes that the frozen pretrained 3D VAE latent space retains the low-amplitude, periodic skin-intensity fluctuations required for accurate rPPG-based HR modulation. A general-purpose 3D VAE is typically optimized for perceptual reconstruction rather than preserving micro-temporal signals; temporal downsampling or averaging in the encoder could erase these components, causing the reported 10 bpm MAE to reflect decoder bias or estimator artifacts rather than genuine physiological editing.

Authors: We acknowledge the validity of this concern about signal preservation in a general-purpose 3D VAE. While our empirical results demonstrate effective HR modulation with low error, we agree that explicit verification strengthens the core claim. In the revised manuscript, we will add a targeted analysis subsection comparing rPPG signals recovered from original input videos versus videos reconstructed from the frozen latents (prior to any editing). This will quantify retention of periodic components at HR-relevant frequencies and justify why the VAE's temporal resolution suffices. We will also discuss the VAE's training on diverse video corpora that include facial content. revision: yes
Referee: The evaluation reports quantitative metrics (PSNR, SSIM, MAE/MAPE) but provides insufficient protocol details in the abstract and experimental sections, including exact datasets used, baseline methods, ablation studies on the trainable AdaLN layers and fine-tuned decoder, and independence verification of the rPPG estimator from the editing pipeline. Without these, the cross-dataset claims and the 10.09% MAPE result cannot be fully assessed for robustness or potential circularity.

Authors: We agree that expanded protocol details are essential for reproducibility and to rule out circularity. The revised manuscript will include a dedicated experimental protocol subsection that: (1) enumerates all datasets with references, subject counts, and train/test splits; (2) describes the baseline methods and their implementations; (3) presents ablation studies quantifying the contribution of the spatio-temporal AdaLN fusion layers and the fine-tuned FiLM output layer; and (4) confirms that the evaluation rPPG estimator is a pretrained state-of-the-art model trained on completely disjoint data, ensuring independence from our editing pipeline. These additions will support robust assessment of the reported metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline uses external pretrained models and independent evaluation

full rationale

The paper describes an empirical editing pipeline that encodes input video with a frozen pretrained 3D VAE, embeds an HR prompt via a frozen text encoder, fuses via trainable AdaLN spatio-temporal layers, and applies FiLM-conditioned decoding with a fine-tuned output layer. Reported results rely on external metrics (PSNR, SSIM) and a separate state-of-the-art rPPG estimator for HR modulation error. No equations, self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided description. The central claims rest on independent benchmarks rather than reducing to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach depends on pretrained models and introduces trainable fusion components whose effectiveness is validated empirically rather than derived from first principles.

free parameters (2)

trainable spatio-temporal layers with AdaLN
Parameters of the fusion layers are learned to combine video latents and HR prompts.
fine-tuned output layer in decoder
Additional parameters adjusted to preserve physiological signals during reconstruction.

axioms (2)

domain assumption Pretrained 3D VAE latent space captures temporal coherence of rPPG signals
Invoked in the encoding step to enable downstream modulation.
domain assumption Frozen text encoder produces usable embeddings for target HR prompts
Used directly for prompt integration without further training.

pith-pipeline@v0.9.0 · 5798 in / 1505 out tokens · 44791 ms · 2026-05-18T11:59:40.980254+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

encode an input video into a latent space via a pretrained 3D Variational Autoencoder (3D VAE) ... fuse them using a set of trainable spatio-temporal layers with Adaptive Layer Normalizations (AdaLN) ... apply Feature-wise Linear Modulation (FiLM) in the decoder
IndisputableMonolith/Foundation/ArrowOfTime.lean forward_accumulates unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

strong temporal coherence of remote Photoplethysmography (rPPG) signals

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Intervention-Based Self-Supervised Learning: A Causal Probe Paradigm for Remote Photoplethysmography
cs.CV 2026-04 unverdicted novelty 7.0

A new intervention-based SSL paradigm for rPPG uses video editing and falsifiability checks to learn the true physiological signal instead of dominant artifacts.