Overview

In the recent decades a lot of researches focused on the computer based human speech processing and speech recognition with varying success. There are two core elements of such systems.

  • The first one is the feature vector of the given voice containing its most relevant parameters.
  • The other one is an artificial intelligence (AI) system, which is able to distinguish between different cases and make decisions in specific situations.

Besides these researches in the Laboratory of Speech Acoustics (LSA) we are also working on the development of an infocommunication based system, that relies on eHelath and Telemedicine methodologies, in order to increase the quality of the diagnoses of speech disorders and improve the quality of the speech therapies. A system like this would provide a fast and easy connection between patients and doctors to help the diagnosis and the monitoring of the patients. It would also provide easy adjustment options for the speech therapies and provide help to teach the patients the correct voice production without leaving their homes.

Database Collection

One of the biggest challenge in this research is to create a significantly big pathological voice database for statistical evaluation of effectiveness and correctness of the different approaches.


In order to provide such a database, in the LSA we are continuously collaborate with medical doctors and hospitals where the speech of the patients are recorded by following a controlled and predefined protocol. Two types of speech are collected from each patient: one is always a sustained vowel (if it is possible repeated 3 times) and the other one is a read text which is regularly used in such phoniatric researches (The North Wind and the Sun). The speech is always recorded in Hungarian language as it is the mother tongue of each patient recorded in our database. The database contains the gender, the age and the diagnosed pathological disorder of the patient, which are validated by the medical doctors. Besides the diagnosis a so called RBH scale (R:Rauchigkeit - roughness, B:Behauchtheit-airness, H:Heiserkeit-hoarseness) is used to classify the patients voice based on the quality of their speech. The RBH values are determined on a 1-4 scale per each by the medical doctors based on their hearing. This semi-subjective scale provides a possible classification method where the hearing of the medical doctors can be characterized according to the different voice disorders.

Preprocessing and classification

To characterize such speech production problems like vocal tract cancer, it is mandatory to identify features which are unique for the given disorders. In the LSA we are mainly focusing on linear and non-liner acoustic phonetic parameters and their statistics (mean, median, distribution, variance etc. ):

  • Jitter
  • Shimmer
  • HNR - Harmonicity to noise ratio
  • MFCC
  • SPI - Soft Phonation Index

For optimal feature vector selection Forward Feature Selection (FFS) is used to define the most significant parameters that can describe the given disorder. After the feature selection different classification and artificial intelligent methods are used to find the best model for a clinical decision support system, which is accurate enough to use in medical applications. Mainly in the LSA we are using Support Vector based statistical models like SVR(Support Vector Regression) or SVM (Support Vector Machine), but also soft computing based models are developed for such multi-dimensional classification problems like Fuzzy or Neural Network based classifiers.