A multiparametric system for forensic speaker comparison

Authors

Keywords:

MFCCs, Voice quality, Stress test, Twins, Disguise

Abstract

In Forensic Speaker Comparison (FSC) several different parameters are commonly analysed. In this investigation we propose a multiparametric system combining long-term features (f0, voice quality and durational aspects) with short-term features (MFCCs), used by a standard automatic system based on i-vector/PLDA approaches (baseline system). The objective was to determine if the performance of the new FSC system is better than that of the baseline system. For this, three experimental designs were carried out –allowing us to evaluate the new multiparametric system in extreme conditions, as if it was a stress test–: (1) use of forensically-realistic characteristics (e.g. background noise, reverberation, intra-speaker variability, signal compression); (2) voice comparison of 12 monozygotic twin pairs; and (3) comparison of disguised voices through noise pinching. The results show that the new system performs better than the baseline system although the mean contribution of long-term features to the new system was 6.5%, with the short-term features being responsible for the remaining 93.5%.

References

ADAMI A., L. BURGET, S. DUPONT, H. GARUDADRI, F. GREZL, H. HERMANSKY, P. JAIN, S. KAJAREKAR, N. MORGAN y S. SIVADAS (2002): «Qualcomm-ICSI-OGI features for ASR», J. H. L. Hansen and B. Pellom, editors, Proc. ICSLP, vol. 1, pp. 4–7, Denver.

AITKEN, C. G. y M. LEESE (1995): Statistics and the evaluation of evidence for forensic scientists (No. 04; QA276. A2, A5). p. 42. J. Wiley.

AITKEN, C. G. y F. TARONI (2004): Statistics and the Evaluation of Evidence for Forensic Scientists. John Wiley & Sons, Chichester.

BERGER, C., ROBERTSON, B., y G. VIGNAUX (2010): «Interpreting scientific evidence». In I. Freckelton & H. Selby, Expert Evidence. Sydney: Thomson Reuters.

BHUTA, T., L. PATRICK, y J. D. GARNETT (2004): «Perceptual evaluation of voice quality and its correlation with acoustic measurements», Journal of Voice, 18(3), pp. 299-304.

BOERSMA, P. y D. WEENINK (2005): «Praat software (version 5.2.01 - 2005)». Amsterdam: University of Amsterdam. Online: http://www.fon.hum.uva.nl/praat [11-11-2012].

BONASTRE, J. F., F. WILS y S. MEIGNIER (2005): «ALIZE, a free toolkit for speaker recognition», ICASSP, vol. 1, pp. 737-740.

BRÜMMER, N. (2004): «Application-independent evaluation of speaker detection», Proc. Odyssey, Speaker and Language recognition workshop, ISCA 2004, pp. 33-40.

BRÜMMER, N. y E. DE VILLIERS (2011): «The BOSARIS toolkit user guide: Theory, algorithms and code for binary classifier score processing», Documentation of BOSARIS toolkit.

BRÜMMER, N., y J. DU PREEZ (2006): «Application-independent evaluation of speaker detection», Computer Speech & Language, 20(2), pp. 230-275.

CHAMPOD, C., y D. MEUWLY (2000): «The inference of identity in forensic speaker recognition», Speech Communication, 31(2-3), pp.193-203.

CHAMPOD, C., TARONI, F, BIEDERMANN, A Y T. HICKS (2018): «Challenging Forensic Science: How Science should speak to Court?», Coursera Massive Open Online Course (MOOC), School of Criminal Justice, ESC, University of Lausanne.

DA COSTA FERNANDES, V. S. (2018): Alterações acústicas e percetivas introduzidas nas vozes de indivíduos gémeos e devidas ao canal telefónico - Uma discussão de impacto na análise forense (Tesis doctoral), Universidad de Porto.

DAVIS, S., y P. MERMELSTEIN (1980): «Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences», IEEE Transactions on Audio, Speech and Language Processing, 28, 4, pp. 357-366.

DEHAK, N., P. J. KENNY, R. DEHAK, P. DUMOUCHEL, y P. OUELLET (2011): «Front-end factor analysis for speaker verification», Audio, Speech, and Language Processing, IEEE Transactions on, 19, 4, pp. 788-798.

DEJONCKERE, P. H., M. REMACLE, E. FRESNEL-ELBAZ, V. WOISARD, L. CREVIER-BUCHMAN, y B. MILLET (1996): «Differentiated perceptual evaluation of pathological voice quality: reliability and correlations with acoustic measurements», Revue de laryngologie-otologie-rhinologie, 117, 3, p. 219.

DE JONG, N. H. y T. WEMPE (2008): «Praat script syllable nuclei».

ENFSI, European Network of Forensic Science Institutes (2015): ENFSI Guideline for Evaluative Reporting in Forensic Science, http://enfsi.eu/wp-content/uploads/2016/09/m1_guideline.pdf [19/02/2019].

EVETT, I. W. (1995): «Avoiding the transposed conditional» Science and Justice, 35(2), pp. 127-131.

EVETT I. W., G. JACKSON, J. A. LAMBERT Y S. MCCROSSAN (2000): «The impact of the principles of evidence interpretation on the structure and content of statements», Science & Justice, 40(4), 233-239.

GIL, J. y E. SAN SEGUNDO (2013): «El disimulo de la cualidad de voz en fonética judicial: un estudio perceptivo para un caso de hiponasalidad», en A. Penas (Ed.) Panorama de la fonética española actual (pp. 321-366). Madrid: Arco / Libros.

GOLD, E. (2018): Articulation Rate as a Speaker Discriminant in British English. Proc. Interspeech 2018, pp.1828-1832.

GONZÁLEZ-RODRÍGUEZ, J., ROSE, P., RAMOS, D., TOLEDANO, D. y J. ORTEGA-GARCÍA (2007): «Emulating DNA: Rigorous quantification of evidential weight in transparent and testable forensic speaker recognition», IEEE Transactions on Audio, Speech and Language Processing, 15(7), pp. 2104-2115.

GURLEKIAN J. A., y N. MOLINA (2012): «Índice de perturbación, de precisión vocal y de grado de aprovechamiento de energía para la evaluación del riesgo vocal», Revista de Logopedia, Foniatría y Audiología, 32, 4, pp. 156-163.

HANSEN J. H. y T. HASAN (2015): «Speaker Recognition by Machines and Humans: A tutorial review», Signal Processing Magazine, IEEE, 32, 6, pp. 74-99.

HAUTAMÄKI, R. G., SAHIDULLAH, M., HAUTAMÄKI, V. y T. KINNUNEN (2017): «Acoustical and perceptual study of voice disguise by age modification in speaker verification», Speech Communication, 95, pp. 1-15.

HERMANSKY, H. (1990): «Perceptual Linear Predictive (PLP) Analysis of Speech», Foundations and Trends in Signal Processing, 87, 4, pp. 1738-1752.

HINTON, G., L. DENG, D. YU, G. E. DAHL, A. R. MOHAMED, N. JAITLY, ... y B. KINGSBURY (2012): «Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups», Signal Processing Magazine, IEEE, 29, 6, pp. 82-97.

HIRANO, M. (1981): Clinical examination of voice. Vienna & New York: Springer.

HOLLIEN H. F. (2002): Forensic voice identification, Academic Press.

HUGHES, V., HARRISON, P., FOULKES, P., FRENCH, P., KAVANAGH, C. y E. SAN SEGUNDO (2017): «Mapping across feature spaces in forensic voice comparison: The contribution of auditory-based voice quality to (semi-) automatic system testing», Proceedings of Interspeech, Stockholm, pp. 3892–3896.

KÜNZEL, H. J. (2000): «Effects of voice disguise on speaking fundamental frequency», International Journal of Speech Language and the Law, 7(2), pp. 150-179.

LAVER, J. (1976): «The semiotic nature of phonetic data», York Papers in Linguistics, 6, pp. 55-62.

LEE K., V. HAUTAMÄKI, T. KINNUNEN, A. LARCHER, C. ZHANG, et al. (2017): «The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016», Annual Conference of the International Association of Speech Communication (Interspeech), Aug 2017, Stockholm, Sweden.

LEHISTE, I. (1970): Suprasegmentals, Cambridge, MA: MIT Press.

MAKHOUL, J. (1975): «Linear prediction: A tutorial review», Proceedings of the IEEE, 63, 4, pp. 561-580.

MARTIN, D., J. FITCH, y V. WOLFE (1995): «Pathologic voice type and the acoustic prediction of severity», Journal of Speech, Language, and Hearing Research, 38, 4, pp. 765-771.

MARTÍNEZ SOLER, M., P. UNIVASO y J. GURLEKIAN (2018): «FORENSIA- Technical Specifications», DOI 10.13140/RG.2.2.36718.92488.

MASTHOFF, H. (1996): «A report on a voice disguise experiment», Forensic Linguistics, 3, pp. 160-167.

MCLAREN, M., L. FERRER, D. CASTAN, y A. LAWSON (2016): «The Speakers in the Wild (SITW) Speaker Recognition Database», Interspeech, pp. 818-822.

MEUWLY, D. (2006): «Forensic individualisation from biometric data», Science and Justice, 46(4), pp. 205-213.

MORRISON, G. S. (2009a): «Likelihood-ratio forensic voice comparison using parametric representations of the formant trajectories of diphthongs», The Journal of the Acoustical Society of America, 125(4), pp. 2387-2397.

MORRISON, G. S. (2009b): «Forensic voice comparison and the paradigm shift», Science and Justice, 49(4), pp. 298-308.

MORRISON, G.S., ROSE, P. y C. ZHANG (2012): «Protocol for the collection of databases of recordings for forensic-voice-comparison research and practice», Australian Journal of Forensic Sciences, 44(2), pp. 155-167.

NOLAN, F. (1983): The phonetic bases of speaker recognition. Cambridge: Cambridge University Press.

RAMOS-CASTRO, D. (2007): Forensic evaluation of the evidence using automatic speaker recognition systems (Tesis doctoral). Universidad Autónoma de Madrid.

ROSE, P. (2002). Forensic speaker identification. London: Taylor & Francis

SABATIER, S. B., TRESTER, M. R., Y J.M. DAWSON (2019): «Measurement of the impact of identical twin voices on automatic speaker recognition», Measurement, 134, pp. 385-389.

SAN SEGUNDO, E. (2013): «A phonetic corpus of Spanish male twins and siblings: Corpus design and forensic application», Procedia-Social and Behavioral Sciences, 95, pp. 59–67.

SAN SEGUNDO, E., ALVES, H. y M. F. TRINIDAD (2013): «CIVIL Corpus: Voice Quality for Speaker Forensic Comparison», Procedia-Social and Behavioral Sciences, 95, pp. 587-593.

SAN SEGUNDO, E. (2014): Forensic speaker comparison of Spanish twins and non-twin siblings: A phonetic-acoustic analysis of formant trajectories in vocalic sequences, glottal source parameters and cepstral characteristics (Tesis doctoral), Consejo Superior de Investigaciones Científicas - Universidad Internacional Menéndez Pelayo, Spain.

SAN SEGUNDO, E. y H. KÜNZEL (2015): «Automatic speaker recognition of Spanish siblings: (monozygotic and dizygotic) twins and non-twin brothers», Loquens, 2(2), 021.

SAN SEGUNDO E. y J. A. MOMPEAN (2017): «A simplified vocal profile analysis protocol for the assessment of voice quality and speaker similarity», Journal of Voice, 31(5), 644-e11.

SAN SEGUNDO, E., FOULKES, P., FRENCH, P., HARRISON, P., HUGHES, V. y C. KAVANAGH (2018): «The use of the Vocal Profile Analysis for speaker characterization: Methodological proposals», Journal of the International Phonetic Association, pp. 1-28.

SAN SEGUNDO, E. y J. YANG (en prensa): «Formant dynamics of Spanish vocalic sequences in related speakers».

SCHWEITZER, N. J. y M.J. SAKS (2007): «The CSI effect: Popular fiction about forensic science affects the public's expectations about real forensic science», Jurimetrics, 47, p.357.

SILVERMAN, B. (1986): Density Estimation for Statistics and Data Analysis, Chapman and Hall, London.

THOMPSON, W. C. y E.L. SCHUMANN (1987): «Interpretation of statistical evidence in criminal trials», Law and Human Behavior 11(3), pp. 167-187.

VAN LEEUWEN, D., y N. BRÜMMER (2007): «An introduction to application-independent evaluation of speaker recognition systems». In C. Müller, Speaker Classification I: Fundamentals, Features, and Methods. (pp. 330-353). Heidelberg: Springer-Verlag.

YOUNG, S., G. EVERMANN, M. GALES, T. HAIN, D. KERSHAW, X. LIU, G. MOORE, J. ODELL, D. OLLASON, D. POVEY, V. VALTECH y P. WOOLAND (2006): “The HTK Book”. Cambridge University Press.

ZHOU, Z. H. (2012): Ensemble methods: foundations and algorithms, Chapman and Hall/CRC.

Published

2019-06-05

How to Cite

San Segundo, E. ., Univaso, P., & Gurlekian, J. (2019). A multiparametric system for forensic speaker comparison. Journal of Experimental Phonetics, 28, 13–45. Retrieved from https://revistes.ub.edu/index.php/experimentalphonetics/article/view/44038

Issue

Section

Articles