Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment

Long-Vu Hoang; Tran Huy Dat; Tuan Nguyen

arxiv: 2506.14148 · v2 · pith:OT4RMU6Enew · submitted 2025-06-17 · 💻 cs.SD · cs.CL· eess.AS

Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment

Long-Vu Hoang , Tuan Nguyen , Tran Huy Dat This is my paper

Pith reviewed 2026-05-22 00:13 UTC · model grok-4.3

classification 💻 cs.SD cs.CLeess.AS

keywords acoustic scatteringhair classificationnon-invasive sensingself-supervised learningsound classificationdeep learningprivacy-preserving AImaterial property assessment

0 comments

The pith

Acoustic scattering from heads carries enough information for self-supervised AI models to classify hair type and moisture at nearly 90 percent accuracy without contact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that sound waves scattered off a head encode structural and material details of the hair, allowing AI to distinguish hair types and moisture levels. Several deep learning strategies are compared, with full fine-tuning of a self-supervised model emerging as the strongest performer at close to 90 percent accuracy. This approach uses only acoustic stimuli and recordings, avoiding cameras or physical touch. The authors position it as a privacy-friendly sensing technique that could apply to material classification tasks beyond hair. If the signals truly distinguish the tested properties, the method supplies a non-visual route to object assessment.

Core claim

When an incident acoustic wave interacts with a head-with-hair-sample object, the resulting scattered field encodes structural and material properties that support classification of hair type and moisture. Benchmarking of fully supervised deep learning, embedding-based methods, supervised foundation model fine-tuning, and self-supervised model fine-tuning shows that fine-tuning all parameters of the self-supervised model reaches nearly 90 percent accuracy, establishing acoustic scattering as a privacy-preserving, non-contact alternative to visual classification with broad industrial potential.

What carries the argument

Scattered acoustic signals from head-with-hair objects, fed into deep learning models whose parameters are fully fine-tuned under self-supervised pre-training for classification.

If this is right

Acoustic scattering supplies a non-contact route to material classification that avoids capturing visual data.
Full fine-tuning of self-supervised models outperforms supervised and embedding-based alternatives on this acoustic task.
The approach extends naturally to other non-invasive assessments of object properties in settings where privacy or contact-free operation matters.
Nearly 90 percent accuracy demonstrates that scattered sound alone can support practical hair type and moisture discrimination.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scattering-plus-AI pipeline could be tested on other soft or fibrous materials to check whether acoustic signatures generalize beyond hair.
Real-time versions might integrate into devices for continuous monitoring without requiring the subject to remain still.
Pairing acoustic results with minimal additional non-visual cues could raise robustness when background noise or head movement varies.

Load-bearing premise

The scattered acoustic signals from head-with-hair-sample objects contain sufficient distinguishable features to support reliable classification of hair type and moisture across the tested conditions and subjects.

What would settle it

New recordings from fresh subjects or hair conditions where classification accuracy falls substantially below 80 percent would show that the signals lack enough distinguishing features.

Figures

Figures reproduced from arXiv: 2506.14148 by Long-Vu Hoang, Tran Huy Dat, Tuan Nguyen.

**Figure 2.** Figure 2: The recording settings [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: An example of spectrograms of ESS stimulus and the corresponding hair-on-head scattered record [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Wav2Vec2-Conformer fine-tuning strategies. 3.3. Supervised fine-tuning with Audio Spectrogram Transformer Adapting large, pre-trained models from related fields is an effective strategy for sound classification when limited datasets preclude fully supervised training. We adopted the Audio Spectrogram Transformer (AST), a convolution-free, state-of-the-art model for audio classification [18], to construc… view at source ↗

**Figure 6.** Figure 6: Receiver operating characteristic curve one-versus-rest on Task 1 [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: Receiver operating characteristic curve one-versus-rest on Task 2 [PITH_FULL_IMAGE:figures/full_fig_p004_7.png] view at source ↗

read the original abstract

This paper presents a novel non-invasive object classification approach using acoustic scattering, demonstrated through a case study on hair assessment. When an incident wave interacts with an object, it generates a scattered acoustic field encoding structural and material properties. By emitting acoustic stimuli and capturing the scattered signals from head-with-hair-sample objects, we classify hair type and moisture using AI-driven, deep-learning-based sound classification. We benchmark comprehensive methods, including (i) fully supervised deep learning, (ii) embedding-based classification, (iii) supervised foundation model fine-tuning, and (iv) self-supervised model fine-tuning. Our best strategy achieves nearly 90% classification accuracy by fine-tuning all parameters of a self-supervised model. These results highlight acoustic scattering as a privacy-preserving, non-contact alternative to visual classification, opening huge potential for applications in various industries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Acoustic scattering plus self-supervised fine-tuning reaches nearly 90% on hair type and moisture but the experimental controls look insufficient to rule out subject confounds.

read the letter

The core result is that fine-tuning all parameters of a self-supervised model on acoustic scattering data from heads gives close to 90% accuracy for classifying hair type and moisture. They benchmark this against fully supervised deep learning, embedding-based methods, and supervised fine-tuning, with the self-supervised approach coming out ahead. This is a straightforward case study applying acoustic signals as a non-contact way to assess hair properties. It positions the method as privacy-friendly compared to cameras. The benchmarking of different training strategies is useful and shows that leveraging self-supervised pretraining helps here. What stands out as new is the specific application to hair assessment using scattered acoustics with modern foundation model techniques. Prior work on acoustic scattering exists, but this combination for this task appears fresh based on the abstract. The soft spots center on the experimental setup. No information appears on dataset size, number of subjects, recording protocol, or validation splits. This makes the stress-test concern about learning head geometry or identity instead of hair features a real issue. If recordings come from a limited set of heads without proper cross-validation or controls, the accuracy may not reflect true hair-specific features. The abstract alone does not resolve this. This kind of paper suits readers working on non-invasive sensing or audio ML applications in consumer or health contexts. It offers a concrete example rather than a broad theoretical advance. I think it deserves peer review to clarify the methods and test whether the results hold under stricter controls.

Referee Report

2 major / 1 minor

Summary. The paper introduces acoustic scattering as a non-invasive, privacy-preserving method for object classification, with a case study on distinguishing hair type and moisture content. Acoustic stimuli are emitted toward head-with-hair-sample objects; the resulting scattered signals are processed by deep-learning pipelines that include fully supervised models, embedding-based classifiers, supervised foundation-model fine-tuning, and self-supervised model fine-tuning. The headline empirical result is that full-parameter fine-tuning of a self-supervised model yields nearly 90% classification accuracy.

Significance. If the reported accuracy is shown to arise from hair-specific scattering features rather than subject identity or head geometry, the work would demonstrate a viable acoustic alternative to visual or contact-based sensing. This could open applications in cosmetics, dermatology, and non-contact material inspection while avoiding image-based privacy risks. The benchmarking of multiple training regimes is a useful contribution, but its impact hinges on rigorous validation that the current description does not yet supply.

major comments (2)

[Abstract / Methods] Abstract and Methods: The manuscript reports a headline accuracy of nearly 90% but provides no information on dataset size, number of subjects, recording protocol, exclusion criteria, or validation strategy (e.g., leave-one-subject-out or subject-independent splits). This information is load-bearing for the central claim that the classifier distinguishes hair properties rather than co-varying anatomical or positional cues.
[Results] Results: Without explicit controls such as bald-head baselines or fixed-head-position recordings, it remains possible that the model exploits consistent head geometry or subject identity instead of hair-specific scattering features. The abstract's phrasing “head-with-hair-sample objects” does not rule out this confound.

minor comments (1)

[Abstract] The abstract would benefit from a brief statement of the number of classes (hair types and moisture levels) and the acoustic frequency range used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions have been made to the manuscript.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: The manuscript reports a headline accuracy of nearly 90% but provides no information on dataset size, number of subjects, recording protocol, exclusion criteria, or validation strategy (e.g., leave-one-subject-out or subject-independent splits). This information is load-bearing for the central claim that the classifier distinguishes hair properties rather than co-varying anatomical or positional cues.

Authors: We agree that these details are critical for supporting the central claim. The original manuscript did not include a sufficiently detailed account of the experimental protocol. In the revised version we have expanded the Methods section with a complete description of dataset size, number of subjects, recording protocol, exclusion criteria, and the subject-independent validation strategy used to reduce the chance that the model learns identity or geometry cues instead of hair properties. revision: yes
Referee: [Results] Results: Without explicit controls such as bald-head baselines or fixed-head-position recordings, it remains possible that the model exploits consistent head geometry or subject identity instead of hair-specific scattering features. The abstract's phrasing “head-with-hair-sample objects” does not rule out this confound.

Authors: We recognize the validity of this concern. While the study used multiple subjects and introduced positional variability during recordings, we did not collect bald-head baseline data. The revised manuscript now includes an explicit discussion of this potential confound, clarifies the rationale for the self-supervised fine-tuning approach, and acknowledges the absence of such controls as a limitation that future work should address. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical classification pipeline

full rationale

The paper reports classification accuracies obtained by training and fine-tuning deep learning models (supervised, embedding-based, and self-supervised) on collected acoustic scattering signals from head-with-hair objects. No equations, derivations, or parameter fittings are presented that reduce the reported ~90% accuracy to an input by construction. The central result is an empirical performance number on held-out data rather than a self-referential prediction or uniqueness claim. Self-citations are absent from the provided text, and the approach relies on standard ML training protocols without load-bearing imported theorems or ansatzes. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that scattered acoustic fields encode usable structural and material information; no free parameters or invented entities are introduced beyond standard deep-learning training.

axioms (1)

domain assumption Scattered acoustic fields encode structural and material properties of objects
Invoked in the opening sentence of the abstract as the physical basis for the classification task.

pith-pipeline@v0.9.0 · 5677 in / 1164 out tokens · 47387 ms · 2026-05-22T00:13:53.889843+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our best strategy achieves nearly 90% classification accuracy by fine-tuning all parameters of a self-supervised model.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

[1]

Traditional classification methods predom- inantly rely on visual data, leveraging deep learning models trained on images and videos [1, 2]

Introduction Classifying complex objects has extensive applications in var- ious industries from industrial inspection, and healthcare to security systems. Traditional classification methods predom- inantly rely on visual data, leveraging deep learning models trained on images and videos [1, 2]. While these approaches achieve high accuracy, they are often...

work page
[2]

Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment

Acoustic scattering in hair type assessment In this section, we describe our setup to perform hair mois- ture assessment using acoustic waves. The block diagram of arXiv:2506.14148v1 [cs.SD] 17 Jun 2025 Figure 1: Schematic diagram of the experiment. Figure 2: The recording settings. Figure 3: Pictures of the dummy mannequin heads used in the study, from l...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Considering that this direction is novel, we borrow the well- known and well-studied techniques from a proximate problem: sound classification

Sound classification methods This section introduces deep learning approaches for the clas- sification of hair contents based on scattered acoustic waves. Considering that this direction is novel, we borrow the well- known and well-studied techniques from a proximate problem: sound classification. Specifically, we investigate common and potential solution...

work page
[4]

Performance was benchmarked on our datasets by ex- tracting embedding vectors from the AudioSet VGGish pre- trained model [17] and fitting them to an XGBoost classifier

with GridSearch parameter optimisation to mitigate over- fitting. Performance was benchmarked on our datasets by ex- tracting embedding vectors from the AudioSet VGGish pre- trained model [17] and fitting them to an XGBoost classifier. Figure 4: An example of spectrograms of ESS stimulus and the corresponding hair-on-head scattered record. Figure 5: Wav2V...

work page
[5]

In this paper, to apply SSL to audio classification, we utilised the Wav2Vec2-Conformer large

have proven to be effective in speech and sound classifica- tion problems [24, 25, 26, 27, 28]. In this paper, to apply SSL to audio classification, we utilised the Wav2Vec2-Conformer large

work page
[6]

The model was pre-trained for 960 hours on Librispeech

model with rotary position embeddings. The model was pre-trained for 960 hours on Librispeech. This SSL model per- formed better than others in our preliminary experiments. We keep the model hyperparameters the same as the pre-trained configuration, with a total number of parameters of 593.6M. Besides, we used a pre-trained voice activity detector [30] to...

work page
[7]

Experimental results 4.1. Dataset and Evaluation metrics The dataset was constructed through multiple recording rounds, each comprising two weekly sessions separated by at least one day to ensure independence and identical distribution. Prior to each experiment, the hair on dummy heads is untangled and combed. Each session involved playing a 5-second ESS ...

work page
[8]

Conclusions In this study, we explored acoustic scattering as a novel tech- nique for non-invasive object classification, with a specific case study on hair moisture and type classification. From the scattered acoustic signals, we evaluated multiple sound clas- sification methods, including embedding-based models, fully supervised deep learning, foundatio...

work page
[9]

Acknowledgements This work was partially supported by the P&G-A ∗STAR Joint Collaboration Fund (project number C212316006)

work page
[10]

Recurrent convolutional neural network for object recognition,

M. Liang and X. Hu, “Recurrent convolutional neural network for object recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3367–3375

work page 2015
[11]

Transformers in vision: A survey,

S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM computing surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022

work page 2022
[12]

L. E. Kinsler, A. R. Frey, A. B. Coppens, and J. V . Sanders, Fun- damentals of acoustics. John wiley & sons, 2000

work page 2000
[13]

Scattering phenomena in acoustic wave propagation,

V . Welsby, “Scattering phenomena in acoustic wave propagation,” Journal of Sound and Vibration, vol. 8, no. 1, pp. 64–96, 1968

work page 1968
[14]

Sound propagation theory for linear ray acous- tic modelling,

S. Kiminki et al., “Sound propagation theory for linear ray acous- tic modelling,” Ph.D. dissertation, Helsinki University of Tech- nology, 2005

work page 2005
[15]

Acoustic imaging of underwater em- bedded objects: Signal simulation for three-dimensional sonar in- strumentation,

M. Palmese and A. Trucco, “Acoustic imaging of underwater em- bedded objects: Signal simulation for three-dimensional sonar in- strumentation,” IEEE transactions on instrumentation and mea- surement, vol. 55, no. 4, pp. 1339–1347, 2006

work page 2006
[16]

Structural health monitoring using guided ul- trasonic waves,

W. J. Staszewski, “Structural health monitoring using guided ul- trasonic waves,” in Advances in smart technologies in structural engineering. Springer, 2004, pp. 117–162

work page 2004
[17]

T. L. Szabo, Diagnostic ultrasound imaging: inside out . Aca- demic press, 2013

work page 2013
[18]

Application of fo- cused ultrasound for the stimulation of neural structures,

L. Gavrilov, E. Tsirulnikov, and I. a. I. Davies, “Application of fo- cused ultrasound for the stimulation of neural structures,” Ultra- sound in medicine & biology, vol. 22, no. 2, pp. 179–192, 1996

work page 1996
[19]

Object classification with convolution neural network based on the time-frequency representation of their echo,

M. Dmitrieva, M. Valdenegro-Toro, K. Brown, G. Heald, and D. Lane, “Object classification with convolution neural network based on the time-frequency representation of their echo,” in2017 IEEE 27th International Workshop on Machine Learning for Sig- nal Processing (MLSP). IEEE, 2017, pp. 1–6

work page 2017
[20]

Advancements in impulse response measurements by sine sweeps,

A. Farina, “Advancements in impulse response measurements by sine sweeps,” inAudio engineering society convention 122. Au- dio Engineering Society, 2007

work page 2007
[21]

Optimizing the exponential sine sweep (ess) signal for in situ measurements on noise barriers,

M. Garai, P. Guidorzi et al. , “Optimizing the exponential sine sweep (ess) signal for in situ measurements on noise barriers,” in PROCEEDINGS EURONOISE. European Acoustics Associ- ation (EAA), 2015, pp. 57–62

work page 2015
[22]

Classification of sound using convo- lutional neural networks,

A. Chaturvedi, S. A. Yadav, H. M. Salman, H. R. Goyal, H. Ge- bregziabher, and A. K. Rao, “Classification of sound using convo- lutional neural networks,” in 2022 5th International Conference on Contemporary Computing and Informatics (IC3I) , 2022, pp. 1015–1019

work page 2022
[23]

Convolutional neural networks for au- dio classification: An ensemble approach,

B. Dave and K. Srivastava, “Convolutional neural networks for au- dio classification: An ensemble approach,” in Proceedings of the 6th International Conference on Advance Computing and Intelli- gent Engineering: ICACIE 2021. Springer, 2022, pp. 253–262

work page 2021
[24]

Moving convolutional neural networks to embedded systems: the alexnet and vgg-16 case,

C. Alippi, S. Disabato, and M. Roveri, “Moving convolutional neural networks to embedded systems: the alexnet and vgg-16 case,” in 2018 17th ACM/IEEE International Conference on In- formation Processing in Sensor Networks (IPSN) . IEEE, 2018, pp. 212–223

work page 2018
[25]

Xgboost: A scalable tree boosting sys- tem,

T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting sys- tem,” in Proceedings of the 22nd acm sigkdd international con- ference on knowledge discovery and data mining, 2016, pp. 785– 794

work page 2016
[26]

Cnn architectures for large-scale audio classification,

S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seyboldet al., “Cnn architectures for large-scale audio classification,” in 2017 ieee international conference on acoustics, speech and signal pro- cessing (icassp). IEEE, 2017, pp. 131–135

work page 2017
[27]

Ast: Audio spectrogram transformer,

Y . Gong, Y .-A. Chung, and J. Glass, “Ast: Audio spectrogram transformer,”arXiv preprint arXiv:2104.01778, 2021

work page arXiv 2021
[28]

Im- agenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im- agenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition . Ieee, 2009, pp. 248–255

work page 2009
[29]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE inter- national conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780

work page 2017
[30]

Lp-musiccaps: Llm-based pseudo music captioning,

S. Doh, K. Choi, J. Lee, and J. Nam, “Lp-musiccaps: Llm-based pseudo music captioning,” arXiv preprint arXiv:2307.16372 , 2023

work page arXiv 2023
[31]

Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

work page 2021
[32]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,” Advances in neural information processing systems , vol. 33, pp. 12 449–12 460, 2020

work page 2020
[33]

Robust audio deep- fake detection using ensemble confidence calibration,

C. Y . Kwok, D.-T. Truong, and J. Q. Yip, “Robust audio deep- fake detection using ensemble confidence calibration,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

work page 2025
[34]

Nes2net: A lightweight nested architecture for foundation model driven speech anti-spoofing,

T. Liu, D.-T. Truong, R. K. Das, K. A. Lee, and H. Li, “Nes2net: A lightweight nested architecture for foundation model driven speech anti-spoofing,”arXiv preprint arXiv:2504.05657, 2025

work page arXiv 2025
[35]

Estimation of speaker age and height from speech signal using bi-encoder transformer mixture model,

T. Gupta, T. D. Truong, T. T. Anh, and E. S. Chng, “Estimation of speaker age and height from speech signal using bi-encoder transformer mixture model,” inInterspeech 2022, 2022, pp. 1978– 1982

work page 2022
[36]

Exploring speaker age estimation on different self-supervised learning models,

D.-T. Truong, T. T. Anh, and C. E. Siong, “Exploring speaker age estimation on different self-supervised learning models,” in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2022, pp. 1950– 1955

work page 2022
[37]

Temporal-channel modeling in multi-head self- attention for synthetic speech detection,

D.-T. Truong, R. Tao, T. Nguyen, H.-T. Luong, K. A. Lee, and E. S. Chng, “Temporal-channel modeling in multi-head self- attention for synthetic speech detection,” in Interspeech 2024 , 2024, pp. 537–541

work page 2024
[38]

Fairseq s2t: Fast speech-to-text modeling with fairseq,

C. Wang, Y . Tang, X. Ma, A. Wu, S. Popuri, D. Okhonko, and J. Pino, “Fairseq s2t: Fast speech-to-text modeling with fairseq,” arXiv preprint arXiv:2010.05171, 2020

work page arXiv 2010
[39]

Silero models: pre-trained enterprise-grade stt / tts models and benchmarks,

S. Team, “Silero models: pre-trained enterprise-grade stt / tts models and benchmarks,” https://github.com/snakers4/ silero-models, 2021

work page 2021
[40]

A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,

Y . Wang, A. Boumadane, and A. Heba, “A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv preprint arXiv:2111.02735, 2021

work page arXiv 2021
[41]

Generalizing auc optimization to multiclass classification for au- dio segmentation with limited training data,

P. Gimeno, V . Mingote, A. Ortega, A. Miguel, and E. Lleida, “Generalizing auc optimization to multiclass classification for au- dio segmentation with limited training data,” IEEE Signal Pro- cessing Letters, vol. 28, pp. 1135–1139, 2021

work page 2021
[42]

Transformers: State-of-the-art natural language processing,

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Metho...

work page 2020

[1] [1]

Traditional classification methods predom- inantly rely on visual data, leveraging deep learning models trained on images and videos [1, 2]

Introduction Classifying complex objects has extensive applications in var- ious industries from industrial inspection, and healthcare to security systems. Traditional classification methods predom- inantly rely on visual data, leveraging deep learning models trained on images and videos [1, 2]. While these approaches achieve high accuracy, they are often...

work page

[2] [2]

Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment

Acoustic scattering in hair type assessment In this section, we describe our setup to perform hair mois- ture assessment using acoustic waves. The block diagram of arXiv:2506.14148v1 [cs.SD] 17 Jun 2025 Figure 1: Schematic diagram of the experiment. Figure 2: The recording settings. Figure 3: Pictures of the dummy mannequin heads used in the study, from l...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Considering that this direction is novel, we borrow the well- known and well-studied techniques from a proximate problem: sound classification

Sound classification methods This section introduces deep learning approaches for the clas- sification of hair contents based on scattered acoustic waves. Considering that this direction is novel, we borrow the well- known and well-studied techniques from a proximate problem: sound classification. Specifically, we investigate common and potential solution...

work page

[4] [4]

Performance was benchmarked on our datasets by ex- tracting embedding vectors from the AudioSet VGGish pre- trained model [17] and fitting them to an XGBoost classifier

with GridSearch parameter optimisation to mitigate over- fitting. Performance was benchmarked on our datasets by ex- tracting embedding vectors from the AudioSet VGGish pre- trained model [17] and fitting them to an XGBoost classifier. Figure 4: An example of spectrograms of ESS stimulus and the corresponding hair-on-head scattered record. Figure 5: Wav2V...

work page

[5] [5]

In this paper, to apply SSL to audio classification, we utilised the Wav2Vec2-Conformer large

have proven to be effective in speech and sound classifica- tion problems [24, 25, 26, 27, 28]. In this paper, to apply SSL to audio classification, we utilised the Wav2Vec2-Conformer large

work page

[6] [6]

The model was pre-trained for 960 hours on Librispeech

model with rotary position embeddings. The model was pre-trained for 960 hours on Librispeech. This SSL model per- formed better than others in our preliminary experiments. We keep the model hyperparameters the same as the pre-trained configuration, with a total number of parameters of 593.6M. Besides, we used a pre-trained voice activity detector [30] to...

work page

[7] [7]

Experimental results 4.1. Dataset and Evaluation metrics The dataset was constructed through multiple recording rounds, each comprising two weekly sessions separated by at least one day to ensure independence and identical distribution. Prior to each experiment, the hair on dummy heads is untangled and combed. Each session involved playing a 5-second ESS ...

work page

[8] [8]

Conclusions In this study, we explored acoustic scattering as a novel tech- nique for non-invasive object classification, with a specific case study on hair moisture and type classification. From the scattered acoustic signals, we evaluated multiple sound clas- sification methods, including embedding-based models, fully supervised deep learning, foundatio...

work page

[9] [9]

Acknowledgements This work was partially supported by the P&G-A ∗STAR Joint Collaboration Fund (project number C212316006)

work page

[10] [10]

Recurrent convolutional neural network for object recognition,

M. Liang and X. Hu, “Recurrent convolutional neural network for object recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3367–3375

work page 2015

[11] [11]

Transformers in vision: A survey,

S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM computing surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022

work page 2022

[12] [12]

L. E. Kinsler, A. R. Frey, A. B. Coppens, and J. V . Sanders, Fun- damentals of acoustics. John wiley & sons, 2000

work page 2000

[13] [13]

Scattering phenomena in acoustic wave propagation,

V . Welsby, “Scattering phenomena in acoustic wave propagation,” Journal of Sound and Vibration, vol. 8, no. 1, pp. 64–96, 1968

work page 1968

[14] [14]

Sound propagation theory for linear ray acous- tic modelling,

S. Kiminki et al., “Sound propagation theory for linear ray acous- tic modelling,” Ph.D. dissertation, Helsinki University of Tech- nology, 2005

work page 2005

[15] [15]

Acoustic imaging of underwater em- bedded objects: Signal simulation for three-dimensional sonar in- strumentation,

M. Palmese and A. Trucco, “Acoustic imaging of underwater em- bedded objects: Signal simulation for three-dimensional sonar in- strumentation,” IEEE transactions on instrumentation and mea- surement, vol. 55, no. 4, pp. 1339–1347, 2006

work page 2006

[16] [16]

Structural health monitoring using guided ul- trasonic waves,

W. J. Staszewski, “Structural health monitoring using guided ul- trasonic waves,” in Advances in smart technologies in structural engineering. Springer, 2004, pp. 117–162

work page 2004

[17] [17]

T. L. Szabo, Diagnostic ultrasound imaging: inside out . Aca- demic press, 2013

work page 2013

[18] [18]

Application of fo- cused ultrasound for the stimulation of neural structures,

L. Gavrilov, E. Tsirulnikov, and I. a. I. Davies, “Application of fo- cused ultrasound for the stimulation of neural structures,” Ultra- sound in medicine & biology, vol. 22, no. 2, pp. 179–192, 1996

work page 1996

[19] [19]

Object classification with convolution neural network based on the time-frequency representation of their echo,

M. Dmitrieva, M. Valdenegro-Toro, K. Brown, G. Heald, and D. Lane, “Object classification with convolution neural network based on the time-frequency representation of their echo,” in2017 IEEE 27th International Workshop on Machine Learning for Sig- nal Processing (MLSP). IEEE, 2017, pp. 1–6

work page 2017

[20] [20]

Advancements in impulse response measurements by sine sweeps,

A. Farina, “Advancements in impulse response measurements by sine sweeps,” inAudio engineering society convention 122. Au- dio Engineering Society, 2007

work page 2007

[21] [21]

Optimizing the exponential sine sweep (ess) signal for in situ measurements on noise barriers,

M. Garai, P. Guidorzi et al. , “Optimizing the exponential sine sweep (ess) signal for in situ measurements on noise barriers,” in PROCEEDINGS EURONOISE. European Acoustics Associ- ation (EAA), 2015, pp. 57–62

work page 2015

[22] [22]

Classification of sound using convo- lutional neural networks,

A. Chaturvedi, S. A. Yadav, H. M. Salman, H. R. Goyal, H. Ge- bregziabher, and A. K. Rao, “Classification of sound using convo- lutional neural networks,” in 2022 5th International Conference on Contemporary Computing and Informatics (IC3I) , 2022, pp. 1015–1019

work page 2022

[23] [23]

Convolutional neural networks for au- dio classification: An ensemble approach,

B. Dave and K. Srivastava, “Convolutional neural networks for au- dio classification: An ensemble approach,” in Proceedings of the 6th International Conference on Advance Computing and Intelli- gent Engineering: ICACIE 2021. Springer, 2022, pp. 253–262

work page 2021

[24] [24]

Moving convolutional neural networks to embedded systems: the alexnet and vgg-16 case,

C. Alippi, S. Disabato, and M. Roveri, “Moving convolutional neural networks to embedded systems: the alexnet and vgg-16 case,” in 2018 17th ACM/IEEE International Conference on In- formation Processing in Sensor Networks (IPSN) . IEEE, 2018, pp. 212–223

work page 2018

[25] [25]

Xgboost: A scalable tree boosting sys- tem,

T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting sys- tem,” in Proceedings of the 22nd acm sigkdd international con- ference on knowledge discovery and data mining, 2016, pp. 785– 794

work page 2016

[26] [26]

Cnn architectures for large-scale audio classification,

S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seyboldet al., “Cnn architectures for large-scale audio classification,” in 2017 ieee international conference on acoustics, speech and signal pro- cessing (icassp). IEEE, 2017, pp. 131–135

work page 2017

[27] [27]

Ast: Audio spectrogram transformer,

Y . Gong, Y .-A. Chung, and J. Glass, “Ast: Audio spectrogram transformer,”arXiv preprint arXiv:2104.01778, 2021

work page arXiv 2021

[28] [28]

Im- agenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im- agenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition . Ieee, 2009, pp. 248–255

work page 2009

[29] [29]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE inter- national conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780

work page 2017

[30] [30]

Lp-musiccaps: Llm-based pseudo music captioning,

S. Doh, K. Choi, J. Lee, and J. Nam, “Lp-musiccaps: Llm-based pseudo music captioning,” arXiv preprint arXiv:2307.16372 , 2023

work page arXiv 2023

[31] [31]

Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

work page 2021

[32] [32]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,” Advances in neural information processing systems , vol. 33, pp. 12 449–12 460, 2020

work page 2020

[33] [33]

Robust audio deep- fake detection using ensemble confidence calibration,

C. Y . Kwok, D.-T. Truong, and J. Q. Yip, “Robust audio deep- fake detection using ensemble confidence calibration,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

work page 2025

[34] [34]

Nes2net: A lightweight nested architecture for foundation model driven speech anti-spoofing,

T. Liu, D.-T. Truong, R. K. Das, K. A. Lee, and H. Li, “Nes2net: A lightweight nested architecture for foundation model driven speech anti-spoofing,”arXiv preprint arXiv:2504.05657, 2025

work page arXiv 2025

[35] [35]

Estimation of speaker age and height from speech signal using bi-encoder transformer mixture model,

T. Gupta, T. D. Truong, T. T. Anh, and E. S. Chng, “Estimation of speaker age and height from speech signal using bi-encoder transformer mixture model,” inInterspeech 2022, 2022, pp. 1978– 1982

work page 2022

[36] [36]

Exploring speaker age estimation on different self-supervised learning models,

D.-T. Truong, T. T. Anh, and C. E. Siong, “Exploring speaker age estimation on different self-supervised learning models,” in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2022, pp. 1950– 1955

work page 2022

[37] [37]

Temporal-channel modeling in multi-head self- attention for synthetic speech detection,

D.-T. Truong, R. Tao, T. Nguyen, H.-T. Luong, K. A. Lee, and E. S. Chng, “Temporal-channel modeling in multi-head self- attention for synthetic speech detection,” in Interspeech 2024 , 2024, pp. 537–541

work page 2024

[38] [38]

Fairseq s2t: Fast speech-to-text modeling with fairseq,

C. Wang, Y . Tang, X. Ma, A. Wu, S. Popuri, D. Okhonko, and J. Pino, “Fairseq s2t: Fast speech-to-text modeling with fairseq,” arXiv preprint arXiv:2010.05171, 2020

work page arXiv 2010

[39] [39]

Silero models: pre-trained enterprise-grade stt / tts models and benchmarks,

S. Team, “Silero models: pre-trained enterprise-grade stt / tts models and benchmarks,” https://github.com/snakers4/ silero-models, 2021

work page 2021

[40] [40]

A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,

Y . Wang, A. Boumadane, and A. Heba, “A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv preprint arXiv:2111.02735, 2021

work page arXiv 2021

[41] [41]

Generalizing auc optimization to multiclass classification for au- dio segmentation with limited training data,

P. Gimeno, V . Mingote, A. Ortega, A. Miguel, and E. Lleida, “Generalizing auc optimization to multiclass classification for au- dio segmentation with limited training data,” IEEE Signal Pro- cessing Letters, vol. 28, pp. 1135–1139, 2021

work page 2021

[42] [42]

Transformers: State-of-the-art natural language processing,

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Metho...

work page 2020