Tawseef , Khan (2021) Speech features analysis of the joint speech separation and automatic speech recognition model / Tawseef Khan. Masters thesis, Universiti Malaya.
PDF (The Candidate's Agreement) Restricted to Repository staff only Download (228Kb) | |
PDF (Thesis M.A) Download (1311Kb) |
Abstract
Speech recognition of target speakers from a mixture of voiced noises from interfering speakers in a single channel is a complex task. This is because the speech signal pattern of both the target and interfering speakers are similar and can be challenging to distinguish from one another. If the target speaker’s speech can be correctly identified, such a system can be used in interviews, courtrooms, transcribing video subtitles, etc. During conversations between multiple speakers, it is common for the voices to overlap. In such cases, it is important to separate the speech of the target speaker based on one single audio signal. To date, ASR models are good at recognizing lexical data in white/background noises though they are unable to perform well with other voiced noises. Recently a joint speech separation and ASR model was proposed that can handle both the task of speech separation and recognition into one component in an end-to-end fashion. Two key factors affecting the accuracy of ASR models are the type of features used to build the model and the signal-to-noise ratio (SNR) of the target signal. This research compares different features to find the optimum features for the joint speech separation and ASR model at different SNR levels. Ten features that were previously used in speech separation of voiced noise have been used to test the accuracy of the model at SNR levels -10, -5, 0, 5, +5 (dB). The experiment evaluates the Word Error Rate (WER) of Speech separation and ASR separately within the joint speech separation and ASR model. Ten features that were used for speech separation in previous studies were evaluated, which are STFT, LOG-POW, LOG-MEL, LOG-MAG, GF, GFCC, MFCC, PNCC, RASTA-PLP (Relative Spectral - Perceptual Predictive), and AMS. At SNR level -10, GF and GFCC was found to have the lowest WER. For SNR levels -5, 0, 5, 10 the lowest WER was achieved by GF, PNCC, STFT, and GF.
Item Type: | Thesis (Masters) |
---|---|
Additional Information: | Dissertation (M.A.) – Faculty of Computer Science & Information Technology, Universiti Malaya, 2021. |
Uncontrolled Keywords: | Speech separation; Automatic speech recognition; Acoustic model; Signal-to-noise ratio; Word error rate |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science Q Science > QA Mathematics > QA76 Computer software |
Divisions: | Faculty of Computer Science & Information Technology |
Depositing User: | Mr Mohd Safri Tahir |
Date Deposited: | 07 Mar 2022 08:02 |
Last Modified: | 07 Mar 2022 08:02 |
URI: | http://studentsrepo.um.edu.my/id/eprint/12942 |
Actions (For repository staff only : Login required)
View Item |