A neural network based post-filter for speech-driven head motion synthesis

Hiroshi Shimodaira; JinHong Lu

arxiv: 1907.10585 · v2 · pith:I6L7RD7Qnew · submitted 2019-07-24 · 📡 eess.SP · cs.LG· eess.AS

A neural network based post-filter for speech-driven head motion synthesis

JinHong Lu , Hiroshi Shimodaira This is my paper

Pith reviewed 2026-05-24 16:39 UTC · model grok-4.3

classification 📡 eess.SP cs.LGeess.AS

keywords speech-driven head motion synthesisneural network post-filtermotion denoisinghead pose predictionanimation post-processingtrajectory smoothing

0 comments

The pith

A neural network trained to reconstruct head motions acts as a post-filter that removes noise and improves smoothness in speech-driven synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that neural networks for speech-driven head motion synthesis produce noisy or discontinuous outputs, requiring post-processing that traditionally trades off smoothness against accuracy. A neural network post-filter is trained to reconstruct clean head motion trajectories from versions corrupted by dropout or Gaussian noise. Objective tests confirm better denoising and smoother joined segments, while analysis shows the filter captures characteristic patterns of head motion. Subjective tests indicate that motions processed by this filter cannot be distinguished from ground truth recordings and are preferred over outputs from Gaussian filters or moving averages.

Core claim

A neural network can be trained to reconstruct head motion sequences so that it removes dropout and Gaussian noise, improves smoothness after motion joining, and learns the statistical characteristics of natural head motions, yielding trajectories that human observers rate as equivalent to ground truth.

What carries the argument

neural network post-filter trained to reconstruct head motions from noisy inputs

If this is right

The filter handles both dropout and Gaussian noise in objective denoising metrics.
Processed head motion segments show measurable gains in smoothness after concatenation.
The network captures statistical properties specific to head motion rather than applying a generic smoothing rule.
In direct comparison, the filtered motions are rated higher than those from Gaussian filters or moving averages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reconstruction approach could be applied to other body-part motion streams such as hand or torso trajectories generated from speech.
Because the filter learns motion statistics, it might reduce the need for keyframe selection in animation pipelines.
Retraining the filter on motions from a target animation style could adapt the output without changing the upstream speech-to-motion model.

Load-bearing premise

The network trained on specific noise types and data will reconstruct new head motion sequences without adding artifacts or erasing perceptually important motion details.

What would settle it

A blind subjective test on head motion sequences generated from a held-out speaker or with noise types absent from training, in which participants can reliably distinguish the filtered output from the original ground-truth recordings.

Figures

Figures reproduced from arXiv: 1907.10585 by Hiroshi Shimodaira, JinHong Lu.

**Figure 1.** Figure 1: The overall framework for our proposed model. A:The regression model predicts head motion frame by frame from the stacking 51 frames of MFCC. B:The auto-encoder for denoising the distinct head motion over 500ms. ’-d’: dimension of the data. Gaussian noise or dropout noise added to the clean data. Lastly, how much improvement the de-noising auto-encoder makes as compared to the linear filters is another ke… view at source ↗

**Figure 2.** Figure 2: The distribution of the Y and Z trajectories in head motion in the speaking region for the ground truth and the prediction model before(BF)/after(AF) the filters. 4.2. PropoedF VS Linear Filters As mentioned in the Sec 1, the linear filter does not have the additional information of the characteristic of the head motion and filters the noise based on the delayed versions of the input signal. We assumed th… view at source ↗

**Figure 3.** Figure 3: The top diagrams show the effect of the different filters de-noising the output from our prediction model in X, Y, Z trajectories. The square-wave plot in each diagram indicates whether the actor is in speaking (up) or listening (down) mode [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: shows the absolute SPARC smoothness values decrease after the filter. It is clear that our proposed filter has stronger filtration effect than the other two linear filters as the absolute smoothness values in three trajectories decrease the most from the predicted result. Before filter After filter Model MSE CCA MSE CCA ProposedF 1.97 0.33 MVA 2.44 0.28 2.20 0.32 GaussianF 2.15 0.33 [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 6.** Figure 6: The percentage preference of A/B test. Lastly, GaussianF is slightly preferred over moving average. However, the neutral is the highest among the six comparison tests, indicating that participants thought both of them are highly similar. 6. Conclusions In this paper, we have studied an effective head motion filter by reconstructing the head movement track in the training stage. We described our data, eval… view at source ↗

read the original abstract

Despite the fact that neural networks are widely used for speech-driven head motion synthesis, it is well-known that the output of neural networks is noisy or discontinuous due to the limited capability of deep neural networks in predicting human motion. Thus, post-processing is required to obtain smooth head motion trajectories for animation. It is common to apply a linear filter or consider keyframes as post-processing. However, neither approach is optimal as there is always a trade-off between smoothness and accuracy. We propose to employ a neural network trained in a way that it is capable of reconstructing the head motions, in order to overcome this limitation. In the objective evaluation, this filter is proved to be good at de-noising data involving types of noise (dropout or Gaussian noise). Objective metrics also demonstrate the improvement of the joined head motion's smoothness after being processed by our proposed filter. A detailed analysis reveals that our proposed filter learns the characteristic of head motions. The subjective evaluation shows that participants were unable to distinguish the synthesised head motions with our proposed filter from ground truth, which was preferred over the Gaussian filter and moving average.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The NN post-filter idea is straightforward but the training on artificial noise does not demonstrate it will clean real synthesis outputs.

read the letter

The main takeaway is that the authors train a neural network to reconstruct clean head motion trajectories from versions corrupted by dropout or Gaussian noise, then apply it as a post-filter on speech-driven synthesis results. They position this as superior to linear filters because it can learn motion characteristics without forcing a smoothness-accuracy trade-off. The objective results show solid denoising performance on those noise types and better smoothness when motions are joined. The subjective test is the strongest part: participants could not distinguish the filtered outputs from ground truth and preferred them over Gaussian filtering or moving averages. That gives some practical evidence the approach produces acceptable animation quality under their test conditions. The analysis that the network learns head motion traits is also a reasonable supporting observation. The soft spot is exactly the one in the stress-test note. Real outputs from speech-driven neural synthesizers tend to have temporally correlated, speech-synchronous deviations rather than the independent noise used in training. The paper reports no direct test of the filter on actual synthesis model outputs, so the claimed benefits for the intended use case rest on an unverified transfer across noise distributions. If that transfer fails, the gains stay limited to the artificial-noise regime. The work is internally consistent on what it measures, with no circular claims or obvious fitting problems. It is a narrow engineering tweak rather than a conceptual shift. Readers working on speech-driven head animation pipelines might find the method worth trying or extending, especially if they already have similar post-processing needs. It is not broad enough for a general reading group. I would not cite it in my own work. It still deserves peer review because the empirical setup is clear enough that referees can check the generalization question directly and the subjective results provide a concrete anchor.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a neural network post-filter for speech-driven head motion synthesis outputs. The network is trained to reconstruct clean head motion trajectories from versions corrupted by artificial dropout or Gaussian noise. Objective evaluations claim effective denoising and improved smoothness of joined motions; a detailed analysis is said to show that the filter learns head motion characteristics. Subjective tests report that filtered motions are indistinguishable from ground truth and preferred over Gaussian filtering and moving averages.

Significance. If the central claim holds and the filter generalizes beyond the artificial noise regime, the work could provide a data-driven alternative to linear post-filters that better balances smoothness and fidelity in animation pipelines. The idea of training a reconstruction network on corrupted motion data is a reasonable extension of denoising autoencoders to this domain, but the current evidence does not yet establish transfer to the actual error statistics of speech-driven synthesis models.

major comments (2)

[Abstract] Abstract / Objective evaluation paragraph: the reported denoising and smoothness gains are obtained exclusively on data corrupted by dropout or i.i.d. Gaussian noise. No experiment evaluates the filter on the temporally correlated, speech-synchronous deviations that characterize real outputs of speech-driven synthesis networks; without such a test the intended use-case claim is unsupported.
[Abstract] Abstract / Subjective evaluation paragraph: the claim that participants 'were unable to distinguish the synthesised head motions with our proposed filter from ground truth' and preferred it over baselines rests on an unreported experimental protocol (number of participants, stimuli, forced-choice design, statistical test). This information is load-bearing for the subjective preference result.

minor comments (2)

The manuscript provides no description of network architecture, training hyperparameters, loss function, dataset sizes, or exact objective metrics (e.g., which error norms, smoothness measures).
[Abstract] The statement that 'a detailed analysis reveals that our proposed filter learns the characteristic of head motions' is asserted without reference to any figure, table, or quantitative result supporting the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract / Objective evaluation paragraph: the reported denoising and smoothness gains are obtained exclusively on data corrupted by dropout or i.i.d. Gaussian noise. No experiment evaluates the filter on the temporally correlated, speech-synchronous deviations that characterize real outputs of speech-driven synthesis networks; without such a test the intended use-case claim is unsupported.

Authors: The objective evaluations were performed on data corrupted by dropout and i.i.d. Gaussian noise, as explicitly stated in the abstract and manuscript. These noise types were selected to simulate common artifacts (e.g., frame loss or jitter) in neural predictions of head motion. The network is trained as a reconstruction model to learn motion characteristics from such corruptions. We agree that direct testing on the actual, temporally correlated errors produced by speech-driven synthesis networks would provide stronger evidence for the intended application. We will revise the manuscript to more precisely limit the claims to the evaluated noise regimes and add a discussion of generalization to real synthesis outputs. revision: partial
Referee: [Abstract] Abstract / Subjective evaluation paragraph: the claim that participants 'were unable to distinguish the synthesised head motions with our proposed filter from ground truth' and preferred it over baselines rests on an unreported experimental protocol (number of participants, stimuli, forced-choice design, statistical test). This information is load-bearing for the subjective preference result.

Authors: The subjective evaluation protocol, including the number of participants, stimuli, forced-choice design, and statistical tests, is described in detail in the body of the manuscript. The abstract summarizes the key outcomes. To address the concern, we will expand the abstract to include the essential protocol details (e.g., participant count and statistical significance) so that the claims are self-contained. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical NN post-filter evaluation

full rationale

The paper trains a neural network to reconstruct head motion from artificially corrupted inputs (dropout/Gaussian) and reports objective denoising/smoothness metrics plus subjective preference over baselines. No equations, derivations, or predictions are presented that reduce by construction to fitted parameters or inputs. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. Claims rest on standard train/test evaluation against external data and human raters, which is self-contained and non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on empirical performance of a trained neural network whose internal parameters are fitted to motion data; no explicit free parameters beyond standard NN training are named, and no new entities are postulated.

free parameters (1)

neural network weights and biases
The post-filter is a trained neural network whose parameters are fitted during supervised training on head motion data.

axioms (1)

domain assumption A neural network can be trained to approximate the mapping from noisy to clean head motion trajectories
The method assumes the reconstruction task is learnable from the available training examples without further justification.

pith-pipeline@v0.9.0 · 5723 in / 1131 out tokens · 27194 ms · 2026-05-24T16:39:02.174990+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

[1]

A neural network based post-filter for speech-driven head motion synthesis

Introduction Predicting human motions using deep neural networks has snowballed and achieved high success in terms of synchronic- ity between the ground truth and predicted one. Some exam- ples include hand gesture and eye-gaze recognition and gen- eration [1], human motion for long-term prediction [2], and speech-driven head motion prediction [3] [4]. Hu...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

The objective criterion for the model is the mean square error normalised by the variance of the ground truth for the training

Proposed Model Our proposed model can be represented as a de-noising auto- encoder for smoothing the predicted discontinuous head mo- tions. The objective criterion for the model is the mean square error normalised by the variance of the ground truth for the training. The overall framework of our proposed model is il- lustrated in Figure 1(B). 2.1. Auto-E...

work page
[3]

Data We use the University of Edinburgh Speaker Personality and Mocap Dataset [11]

Experiment 3.1. Data We use the University of Edinburgh Speaker Personality and Mocap Dataset [11]. This database contains expressive dia- logues between semi-professional actors in extroverted and in- troverted speaking styles, and the dialogues were non-scripted and spontaneous. There is a total of 13 speakers, with 123 ﬁles in the data set, and each ﬁl...

work page
[4]

with dropout refers to the addition of a dropout layer before the input layer of the model

Objective Evaluation We use four metrics to compare the predicted head motion with the ground truth: the mean-squared error (MSE), the lo- cal canonical correlation analysis (CCA) [3] over 500ms win- Model DN GN 150 − 300 − 60no dropout 0.20 0.70 150 − 300 − 60with dropout 0.22 0.66 150 − 3000 − 3000no dropout 0.18 0.28 150 − 3000 − 3000with dropout 0.19 ...

work page
[5]

We randomly selected 15 speaking regions in an audio ﬁle and split them into ﬁve test groups

Subjective Evaluation We conducted A/B preference tests on the naturalness of the synthesised head motion animation. We randomly selected 15 speaking regions in an audio ﬁle and split them into ﬁve test groups. Each test group has a total of 18 comparison tests comparing between ground truth, head motion ﬁltered by Pro- posedF, head motion ﬁltered by Gaus...

work page
[6]

We described our data, evaluated the feasibility of our ﬁlter model, and compared the ﬁltration effect with common linear ﬁlters

Conclusions In this paper, we have studied an effective head motion ﬁlter by reconstructing the head movement track in the training stage. We described our data, evaluated the feasibility of our ﬁlter model, and compared the ﬁltration effect with common linear ﬁlters. From extensive evaluations, we can conclude that (1) an appropriate number in the middle...

work page
[7]

Recognition and generation of communicative sig- nals: Modeling of hand gestures, speech activity and eye-gaze in human-machine interaction,

K. Stefanov, “Recognition and generation of communicative sig- nals: Modeling of hand gestures, speech activity and eye-gaze in human-machine interaction,” 2018

work page 2018
[8]

Learning Human Motion Models for Long-term Predictions

P. Ghosh, J. Song, E. Aksan, and O. Hilliges, “Learning human motion models for long-term predictions,” CoRR, vol. abs/1704.02827, 2017. [Online]. Available: http://arxiv.org/abs/ 1704.02827

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Bidirectional lstm networks em- ploying stacked bottleneck features for expressive speech-driven head motion synthesis,

K. Haag and H. Shimodaira, “Bidirectional lstm networks em- ploying stacked bottleneck features for expressive speech-driven head motion synthesis,” in Intelligent Virtual Agents, D. Traum, W. Swartout, P. Khooshabeh, S. Kopp, S. Scherer, and A. Leuski, Eds. Cham: Springer International Publishing, 2016, p. 198207

work page 2016
[10]

Head motion synthesis from speech using deep neural networks,

C. Ding, L. Xie, and P. Zhu, “Head motion synthesis from speech using deep neural networks,” Multimedia Tools and Applications, vol. 74, no. 22, pp. 9871–9888, Nov 2015. [Online]. Available: https://doi.org/10.1007/s11042-014-2156-2

work page doi:10.1007/s11042-014-2156-2 2015
[11]

Blstm neural networks for speech driven head motion synthesis,

C. Ding, P. Zhu, and L. Xie, “Blstm neural networks for speech driven head motion synthesis,” in INTERSPEECH, 2015

work page 2015
[12]

Speech parameter generation algorithms for hmm-based speech synthesis,

K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kita- mura, “Speech parameter generation algorithms for hmm-based speech synthesis,” in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), vol. 3, June 2000, pp. 1315–1318 vol.3

work page 2000
[13]

Novel realizations of speech-driven head movements with generative adversarial networks,

N. Sadoughi and C. Busso, “Novel realizations of speech-driven head movements with generative adversarial networks,” in IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP 2018), 2018

work page 2018
[14]

Rigid head motion in expressive speech animation: Analysis and synthesis,

C. Busso, Z. Deng, M. Grimm, U. Neumann, and S. Narayanan, “Rigid head motion in expressive speech animation: Analysis and synthesis,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1075–1086, March 2007

work page 2007
[15]

Automatic head motion prediction from speech data,

G. Hofer and H. Shimodaira, “Automatic head motion prediction from speech data,” in INTERSPEECH, 2007

work page 2007
[16]

Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,

P. Vincent, H. Larochelle, I. Lajoie, Y . Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J. Mach. Learn. Res. , vol. 11, pp. 3371–3408, Dec. 2010. [Online]. Available: http://dl.acm.org/citation.cfm? id=1756006.1953039

work page arXiv 2010
[17]

The university of edinburgh speaker personality and mocap dataset,

K. Haag and H. Shimodaira, “The university of edinburgh speaker personality and mocap dataset,” in FAA, 2015

work page 2015
[18]

Naturalpoint optitrack

“Naturalpoint optitrack.” [Online]. Available: http://www. naturalpoint.com/optitrack

work page
[19]

Determining the movements of the skeleton using well-conﬁgured markers,

I. Soderkvist and P. Wedin, “Determining the movements of the skeleton using well-conﬁgured markers,” Journal of biomechan- ics, vol. 26, pp. 1473–7, 01 1994

work page 1994
[20]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[21]

On the analysis of movement smoothness,

S. Balasubramanian, A. Melendez-Calderon, A. Roby-Brami, and E. Burdet, “On the analysis of movement smoothness,” Journal of NeuroEngineering and Rehabilitation , vol. 12, no. 1, p. 112, Dec 2015. [Online]. Available: https://doi.org/10.1186/ s12984-015-0090-9

work page 2015

[1] [1]

A neural network based post-filter for speech-driven head motion synthesis

Introduction Predicting human motions using deep neural networks has snowballed and achieved high success in terms of synchronic- ity between the ground truth and predicted one. Some exam- ples include hand gesture and eye-gaze recognition and gen- eration [1], human motion for long-term prediction [2], and speech-driven head motion prediction [3] [4]. Hu...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[2] [2]

The objective criterion for the model is the mean square error normalised by the variance of the ground truth for the training

Proposed Model Our proposed model can be represented as a de-noising auto- encoder for smoothing the predicted discontinuous head mo- tions. The objective criterion for the model is the mean square error normalised by the variance of the ground truth for the training. The overall framework of our proposed model is il- lustrated in Figure 1(B). 2.1. Auto-E...

work page

[3] [3]

Data We use the University of Edinburgh Speaker Personality and Mocap Dataset [11]

Experiment 3.1. Data We use the University of Edinburgh Speaker Personality and Mocap Dataset [11]. This database contains expressive dia- logues between semi-professional actors in extroverted and in- troverted speaking styles, and the dialogues were non-scripted and spontaneous. There is a total of 13 speakers, with 123 ﬁles in the data set, and each ﬁl...

work page

[4] [4]

with dropout refers to the addition of a dropout layer before the input layer of the model

Objective Evaluation We use four metrics to compare the predicted head motion with the ground truth: the mean-squared error (MSE), the lo- cal canonical correlation analysis (CCA) [3] over 500ms win- Model DN GN 150 − 300 − 60no dropout 0.20 0.70 150 − 300 − 60with dropout 0.22 0.66 150 − 3000 − 3000no dropout 0.18 0.28 150 − 3000 − 3000with dropout 0.19 ...

work page

[5] [5]

We randomly selected 15 speaking regions in an audio ﬁle and split them into ﬁve test groups

Subjective Evaluation We conducted A/B preference tests on the naturalness of the synthesised head motion animation. We randomly selected 15 speaking regions in an audio ﬁle and split them into ﬁve test groups. Each test group has a total of 18 comparison tests comparing between ground truth, head motion ﬁltered by Pro- posedF, head motion ﬁltered by Gaus...

work page

[6] [6]

We described our data, evaluated the feasibility of our ﬁlter model, and compared the ﬁltration effect with common linear ﬁlters

Conclusions In this paper, we have studied an effective head motion ﬁlter by reconstructing the head movement track in the training stage. We described our data, evaluated the feasibility of our ﬁlter model, and compared the ﬁltration effect with common linear ﬁlters. From extensive evaluations, we can conclude that (1) an appropriate number in the middle...

work page

[7] [7]

Recognition and generation of communicative sig- nals: Modeling of hand gestures, speech activity and eye-gaze in human-machine interaction,

K. Stefanov, “Recognition and generation of communicative sig- nals: Modeling of hand gestures, speech activity and eye-gaze in human-machine interaction,” 2018

work page 2018

[8] [8]

Learning Human Motion Models for Long-term Predictions

P. Ghosh, J. Song, E. Aksan, and O. Hilliges, “Learning human motion models for long-term predictions,” CoRR, vol. abs/1704.02827, 2017. [Online]. Available: http://arxiv.org/abs/ 1704.02827

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Bidirectional lstm networks em- ploying stacked bottleneck features for expressive speech-driven head motion synthesis,

K. Haag and H. Shimodaira, “Bidirectional lstm networks em- ploying stacked bottleneck features for expressive speech-driven head motion synthesis,” in Intelligent Virtual Agents, D. Traum, W. Swartout, P. Khooshabeh, S. Kopp, S. Scherer, and A. Leuski, Eds. Cham: Springer International Publishing, 2016, p. 198207

work page 2016

[10] [10]

Head motion synthesis from speech using deep neural networks,

C. Ding, L. Xie, and P. Zhu, “Head motion synthesis from speech using deep neural networks,” Multimedia Tools and Applications, vol. 74, no. 22, pp. 9871–9888, Nov 2015. [Online]. Available: https://doi.org/10.1007/s11042-014-2156-2

work page doi:10.1007/s11042-014-2156-2 2015

[11] [11]

Blstm neural networks for speech driven head motion synthesis,

C. Ding, P. Zhu, and L. Xie, “Blstm neural networks for speech driven head motion synthesis,” in INTERSPEECH, 2015

work page 2015

[12] [12]

Speech parameter generation algorithms for hmm-based speech synthesis,

K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kita- mura, “Speech parameter generation algorithms for hmm-based speech synthesis,” in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), vol. 3, June 2000, pp. 1315–1318 vol.3

work page 2000

[13] [13]

Novel realizations of speech-driven head movements with generative adversarial networks,

N. Sadoughi and C. Busso, “Novel realizations of speech-driven head movements with generative adversarial networks,” in IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP 2018), 2018

work page 2018

[14] [14]

Rigid head motion in expressive speech animation: Analysis and synthesis,

C. Busso, Z. Deng, M. Grimm, U. Neumann, and S. Narayanan, “Rigid head motion in expressive speech animation: Analysis and synthesis,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1075–1086, March 2007

work page 2007

[15] [15]

Automatic head motion prediction from speech data,

G. Hofer and H. Shimodaira, “Automatic head motion prediction from speech data,” in INTERSPEECH, 2007

work page 2007

[16] [16]

Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,

P. Vincent, H. Larochelle, I. Lajoie, Y . Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J. Mach. Learn. Res. , vol. 11, pp. 3371–3408, Dec. 2010. [Online]. Available: http://dl.acm.org/citation.cfm? id=1756006.1953039

work page arXiv 2010

[17] [17]

The university of edinburgh speaker personality and mocap dataset,

K. Haag and H. Shimodaira, “The university of edinburgh speaker personality and mocap dataset,” in FAA, 2015

work page 2015

[18] [18]

Naturalpoint optitrack

“Naturalpoint optitrack.” [Online]. Available: http://www. naturalpoint.com/optitrack

work page

[19] [19]

Determining the movements of the skeleton using well-conﬁgured markers,

I. Soderkvist and P. Wedin, “Determining the movements of the skeleton using well-conﬁgured markers,” Journal of biomechan- ics, vol. 26, pp. 1473–7, 01 1994

work page 1994

[20] [20]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014

[21] [21]

On the analysis of movement smoothness,

S. Balasubramanian, A. Melendez-Calderon, A. Roby-Brami, and E. Burdet, “On the analysis of movement smoothness,” Journal of NeuroEngineering and Rehabilitation , vol. 12, no. 1, p. 112, Dec 2015. [Online]. Available: https://doi.org/10.1186/ s12984-015-0090-9

work page 2015