Phonetic characteristics of spontaneous speech in a total laryngectomized Italian speaker: Perspectives for speech enhancement algorithms

Chiara Meluzzi; Sonia Cenceschi; Francesco Roberto Dani; Alessandro Trivilini

doi:10.1344/efe-2022-31-45-58

Authors

Chiara Meluzzi Università degli Studi di Milano https://orcid.org/0000-0002-2291-006X
Sonia Cenceschi Scuola Universitaria Professionale della Svizzera Italiana https://orcid.org/0000-0002-4145-9593
Francesco Roberto Dani Scuola Universitaria Professionale della Svizzera Italiana https://orcid.org/0000-0001-9768-5592
Alessandro Trivilini Scuola Universitaria Professionale della Svizzera Italiana https://orcid.org/0000-0003-0687-801X

DOI:

https://doi.org/10.1344/efe-2022-31-45-58

Keywords:

Oesophageal Speech, Voice quality, Voice disorders, Vowel formants, Clinical phonetics

Abstract

This paper describes the main phonetic features of an Italian L1 74 y. o. speaker (ESO01) after he endured total laryngectomy in 2015 with the complete removal of vocal folds due to five tumour masses. We offer an acoustic analysis of the spontaneous speech of this target speaker, in order to lay ground to the development of spontaneous speech enhancement and reconstruction algorithms for non-invasive aids. A semi-automatic analysis extracts formants’ values (F0, F1, F2, F3) on the midpoint and on 7 time-points, together with other acoustic cues. Our results show that our target speaker presents a low and rough voice, but his vowels are clearly differentiated. Furthermore, we find vocoid and air release to be extremely consistent in his acoustic characteristics during oesophageal phonation.

References

Bressmann, T. (2010). Speech disorders related to head and neck cancer: Laryngectomy, glossectomy, and velopharyngeal and maxillofacial deficits. In J. S. Damico, N. Müller, & M. Ball (Eds.), The handbook of language and speech disorders (pp. 497-526). Wiley-Blackwell. https://doi.org/10.1002/9781444318975.ch22

Brosky, M. E. (2007). The role of saliva in oral health: Strategies for prevention and management of xerostomia, The Journal of Supportive Oncology, 5(5), 215-225.

Brouha, X., Tromp, D., Hordijk, G. J., Winnubst, J., & De Leeuw, R. (2005). Role of alcohol and smoking in diagnostic delay of head and neck cancer patients. Acta Oto-Laryngologica, 125(5), 552-556. https://doi.org/10.1080/00016480510028456

Campbell, N. & Beckman, M. (1997). Stress, prominence, and spectral tilt. In A. Botinis, G. Kouroupetroglou, & G. Crayiannis (Eds.), Intonation: theory, models and applications (pp. 67-70). European Speech Communication Association.

Casper, J. K., & Colton, R. H. (1998). Clinical manual for laryngectomy and head & neck cancer rehabilitation. Singular.

Cervera, T., Miralles, J. L., & González-Alvarez, J. (2001). Acoustical analysis of Spanish vowels produced by laryngectomized subjects. Journal of Speech, Language and Hearing Resarch, 44(5), 988-96. https://doi.org/10.1044/1092-4388(2001/077)

Childers, D. G. (ed.). (1978). Modern spectrum analysis. IEEE Computer Society Press.

Christensen, J. M., Weinberg, B., & Alfonso, P. J. (1978). Productive voice onset time characteristics of esophageal speech. Journal of Speech. Language and Hearing Resarch, 21(1), 56-62. https://doi.org/10.1044/jshr.2101.56

Cohen, A., Van Den Broeckero, M. P., & Van Geel, R. C. (1984). A study of pitch phenomena and applications in electrolarynx speech. Speech and Language, 11, 197-248. https://doi.org/10.1016/B978-0-12-608611-9.50010-3

Cummings, L. C., & Cooper, G. S. (2008). Descriptive epidemiology of esophageal carcinoma in the Ohio Cancer Registry, Cancer detection and prevention, 32(1), 87-92. https://doi.org/10.1016/j.cdp.2008.02.005

Esen Aydinli, F., Kulak Kayikci, M. E., & Suslu, N. (2019). Temporal and Frequency Characteristics of Turkish Vowels in Laryngectomized Speakers: Preliminary Study. Medeniyet Medical Journal, 34(2), 149-159. https://doi.org/10.5222/MMJ.2019.42744

Debruyne, F., Delaere, P., Wouters, J., & Uwents, P. (1994). Acoustic analysis of tracheoesophageal speech. The Journal of Laryngology and Otology, 108, 325-328. https://doi.org/10.1017/S0022215100126660

Doyle, P. C., & Finchem, E. A. (2019). Teaching esophageal speech: A process of collaborative instruction. In P. C. Doyle (Ed.), Clinical Care and Rehabilitation in Head and Neck Cancer (pp. 145-161). Springer. https://doi.org/10.1007/978-3-030-04702-3_10

Di Paolo, M., Yaeger-Dror, M., & Wassink, A. B. (2011). Analyzing vowels. In M. Di Paolo, & M. Yaeger-Dror (Eds.), Sociophonetics. A student's guide (pp. 87-106). Routlege.

Draetta, L. (2019). Dittonghi e iati nella pronuncia di bambini biellesi: un'analisi sociofonetica [MA thesis]. Università di Pavia.

Erzin, E. (2009). Improving throat microphone speech recognition by joint analysis of throat and acoustic microphone recordings. IEEE transactions on audio, speech, and language processing, 17(7), 1316-1324. https://doi.org/10.1109/TASL.2009.2016733

Fant, G. (1960). The acoustics of speech. Mouton De Gruyter.

Fantini, M., Maccarini, A. R., Firino, A., Gallia, M., Carlino, V., Gorris, C., Spadola Bisetti, M., Crosetti, E., & Succo, G. (2021). Validation of the Acoustic Voice Quality Index (AVQI) Version 03.01 in Italian. Journal of Voice, S0892-1997(21)00092-8 [Advance online publication].

Giannini, A., & Pettorino, M. (1992). La fonetica sperimentale. Edizioni Scientifiche Italiane.

Goldstein, D. P., & Irish, J. C. (2005). Head and neck squamous cell carcinoma in the young patient. Current Opinion in Otolaryngology and Head and Neck Surgery, 13(4), 207-11. https://doi.org/10.1097/01.moo.0000170529.04759.4c

Graham, M. S. (2005). Taking it to the limits: Achieving proficient esophageal speech. In P. C. Doyle, & R. L. Keith (Eds.), Contemporary considerations in the treatment and rehabilitation of head and neck cancer (pp. 379-430). Pro-Ed.

Heeringa, W., & Van de Velde, H. (2018). Visible Vowels: A Tool for the Visualization of Vowel Variation. In I. Skadiņa, & M. Eskevich (Eds.), Proceedings of CLARIN Annual Conference 2018, Pisa, Italy (pp. 124-127). CLARIN. https://doi.org/10.32614/CRAN.package.visvow

Jackson, M., Ladefoged, P., Huffman, M., & Antoñanzas‐Barroso, N. (1985). Measures of spectral tilt. The Journal of the Acoustical Society of America, 77, S86 [2:49, MM8]. https://doi.org/10.1121/1.2022557

Kobayashi, N., Horiguchi, S., Baer, T. (1985) Aerodynamic and acoustic characteristics of the voicing distinction in electronic larynx speech. The Journal of the Acoustical Society of America, 77, S86 [3:13, MM10]. https://doi.org/10.1121/1.2022559

Liu, H., & Ng, M. L. (2007). Electrolarynx in voice rehabilitation. Auris Nasus Larynx, 34(3), 327-332. https://doi.org/10.1016/j.anl.2006.11.010

Liu, H., Ng, M. L. (2009). Formant characteristics of vowels produced by Mandarin esophageal speakers. Journal of Voice, 23(2), 255-60. https://doi.org/10.1016/j.jvoice.2007.09.002

Maryn Y, Corthals P, Van Cauwenberge P, et al. (2010). Toward improved ecological validity in the acoustic measurement of overall voice quality: Combining continuous speech and sustained vowels. Journal of Voice, 24(5), 540-555. https://doi.org/10.1016/j.jvoice.2008.12.014

Meluzzi, C. (2021). Sound Spectrography. In M. Ball (Ed.), Handbook of Clinical Phonetics (pp. 418-443). Routledge. https://doi.org/10.4324/9780429320903-30

Nakajima, Y., Kashioka, H., Campbell, N., & Shikano, K. (2006). Non-audible murmur (NAM) recognition. IEICE Transactions on Information and Systems, 89(1), 1-4. https://doi.org/10.1093/ietisy/e89-d.1.1

Pascual, S., Serrà, J., & Bonafonte, A. (2019). Towards generalized speech enhancement with generative adversarial networks. arXiv preprint. In G. Kubin, & Z. KačičProc (Eds.), Proceedings of Interspeech 2019, Graz, Austria (pp. 1791-1795). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2019-2688

Patel, M., Parmar, M., Doshi, S., Shah, N., & Patil, H. A. (2019). Novel Inception-GAN for Whisper-to-Normal Speech Conversion. In M. Pucher (Ed.), Proceedings of 10th ISCA Speech Synthesis Workshop (SSW 10), Vienna, Austria (pp. 87-92). International Speech Communication Association. https://doi.org/10.21437/SSW.2019-16

Powell, T. W. (2013). Research Ethics. In N. Muller & M. J. Ball (Eds.), Research Methods in Clinical Linguistics and Phonetics. A practical guide (pp. 10-27). Wiley-Blackwell.

Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (2007). Numerical recipes 3rd edition: The art of scientific computing. Cambridge University Press.

Preston, J. L., Maas, E., Whittle, J., Leece, M. C., & McCabe, P. (2016). Limited acquisition and generalisation of rhotics with ultrasound visual feedback in childhood apraxia. Clinical Linguistics & Phonetics, 30(3-5), 363-381. https://doi.org/10.3109/02699206.2015.1052563

Ribeiro, V. V., Dassie-Leite, A. P., Pereira, E. C., Nunes Santos, A. D., Martins, P., & Irineu, R. de A. (2020). Effect of wearing a face mask on vocal self-perception during a pandemic. Journal of Voice, S0892-1997(20)30356-8 [Advance online publication].

Robbins, J. (1984). Acoustic differentiation of laryngeal, esophageal, and tracheo-oesophageal speech. Journal of Speech and Hearing Research, 27(4), 577-585. https://doi.org/10.1044/jshr.2704.577

Sahidullah, M., Gonzalez Hautamäki, R., Lehmann, Thomsen., D. A., Kinnunen, T., Tan, Z.-H., Hautamäki, V., Parts, R., & Pitkänen, M. (2016). Robust speaker recognition with combined use of acoustic and throat microphone speech. In N. Morgan (Ed.), Proceedings of Interspeech 2016, San Francisco, USA (pp. 1720-1724). ISCA. https://doi.org/10.21437/Interspeech.2016-1153

Sahidullah, M., Thomsen, D. A. L., Hautamäki, R. G., Kinnunen, T., Tan, Z. H., Parts, R., & Pitkänen, M. (2017). Robust voice liveness detection and speaker verification using throat microphones. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1), 44-56. https://doi.org/10.1109/TASLP.2017.2760243

Shah, N. J., & Patil, H. A. (2020). Non-audible murmur to audible speech conversion. In H. A. Patil, & A. Neustein (Eds.), Voice Technologies for Speech Reconstruction and Enhancement (pp. 125-150). De Gruyter. https://doi.org/10.1515/9781501501265-006

Shahina, A., & Yegnanarayana, B. (2007). Mapping speech spectra from throat microphone to close-speaking microphone: A neural network approach. EURASIP Journal on Advances in Signal Processing, 087219. https://doi.org/10.1155/2007/87219

Sharifzadeh, H. R., McLoughlin, I. V., & Ahmadi, F. (2010). Reconstruction of normal sounding speech for laryngectomy patients through a modified CELP codec. IEEE Transactions on Biomedical Engineering, 57(10), 2448-2458. https://doi.org/10.1109/TBME.2010.2053369

Sisty, N. L., & Weinberg, B. (1972). Formant frequency characteristics of esophageal speech. Journal of Speech and Hearing Research, 15(2), 439-448. https://doi.org/10.1044/jshr.1502.439

Štajner-Katušić, S., Horga, D., Mušura, M., & Globlek, D. (2004). Voice and Speech after Laryngectomy. Clinical Linguistics & Phonetics, 20(2/3), 195-203. https://doi.org/10.1080/02699200400026975

Stylianou, Y. (1996). Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification [PhD thesis]. Ecole Nationale Superieure des Telecommunications.

Toda, T., & Shikano, K. (2005). NAM-to-speech conversion with Gaussian mixture models. In I. Trancoso (Ed.), Proceedings of Interspeech 2005, Lisbon, Portugal (pp. 1957-1960). ISCA. https://doi.org/10.21437/Interspeech.2005-611

Tran, V. A., Bailly, G., Loevenbruck, H., & Toda, T. (2009). Multimodal HMM-based NAM-to-speech conversion. In. R. Moore (Ed.), Proceedings of Interspeech 2009, Brighton, United Kingdom (pp. 656-659). ISCA. https://doi.org/10.21437/Interspeech.2009-230

Turan, M. A. T. (2018). Enhancement of Throat Microphone Recordings Using Gaussian Mixture Model Probabilistic Estimator. arXiv preprint, arXiv:1804.05937.

van Sluis, K. E., van der Molen, L., van Son, R. J., Hilgers, F. J., Bhairosing, P. A., & van den Brekel, M. W. (2018). Objective and subjective voice outcomes after total laryngectomy: A systematic review. European Archives of Oto-Rhino-Laryngology, 275(1), 11-26. https://doi.org/10.1007/s00405-017-4790-6

Williams, S. E., & Watson, J. B. (1987). Speaking proficiency variations according to method of alaryngeal voicing. The Laryngoscope, 97(6), 737-739. https://doi.org/10.1288/00005537-198706000-00018

Zheng, Y., Liu, Z., Zhang, Z., Sinclair, M., Droppo, J., Deng, L., & Huang, X. (2003). Air-and bone-conductive integrated microphones for robust speech detection and enhancement. In J. Bilmes, & W. Byrne (Eds.), IEEE Workshop on Automatic Speech Recognition and Understanding [St. Thomas, VI, USA] (pp. 249-254). IEEE.

Zhou, J., Liang, R., Zhao, L., & Zou, C. (2012). Whisper intelligibility enhancement using a supervised learning approach. Circuits, Systems, and Signal Processing, 31(6), 2061-2074. https://doi.org/10.1007/s00034-012-9415-0

Phonetic characteristics of spontaneous speech in a total laryngectomized Italian speaker

Perspectives for speech enhancement algorithms

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Information

Make a Submission