Rashid , Jahangir (2021) Speaker identification through feature fusion based deep learning / Rashid Jahangir. PhD thesis, Universiti Malaya.
PDF (Thesis PhD) Download (2325Kb) |
Abstract
Speech is a powerful medium of communication that always conveys rich and useful information, such as gender, accent, and other unique characteristics of a speaker. These unique characteristics enable researchers to recognise human voice using artificial intelligence techniques that are important in the areas of security and surveillance, electronic voice eavesdropping, mobile banking, mobile shopping and speaker identification (SI). Speaker identification refers to the process of recognising human voices using artificial intelligence techniques. In the speaker identification process, extracting discriminative and salient features from speaker utterances is an important task to accurately identify speakers. Recently, various short-time features, such as Mel frequency cepstral coefficients (MFCC) have been utilised due to their capability to capture the repetitive nature and efficiency of signals. Various studies have shown the effectiveness of the aforementioned features in correctly identifying speakers. However, the performances of these features degrade on complex speech datasets, and therefore, these features fail to accurately identify speaker characteristics. To address this problem, this study proposes a novel fusion of MFCC and time-based features (MFCCT), which combines the effectiveness of MFCC and time-domain features to improve the accuracy of text-independent speaker identification (SI) systems. On the other hand, recent advancements in deep learning have gained the attention of researchers working in the field of automatic speaker identification (SI). Therefore, the extracted MFCCT features were fed as input to a deep neural network (DNN) to construct the speaker identification model. The experimental results showed that the proposed MFCCT features coupled with DNN outperformed existing baseline MFCC and time-domain features on the LibriSpeech dataset. In addition, DNN obtained better classification results compared with the other five machine learning algorithms that were recently utilised in speaker recognition. Finally, this study also investigated the effectiveness of one-level (identifying speaker) and two-level (identify the gender first and then identify the speaker) classification methods for speaker identification. The experimental results showed that two-level classification presented better results compared to one-level classification. It is believed that the proposed features and two-level classification model for identifying a speaker can be widely applied to different types of complex speaker datasets. Moreover, our proposed features and classification method can be adopted in several domains working in the area of voice recognition.
Item Type: | Thesis (PhD) |
---|---|
Additional Information: | Thesis (PhD) – Faculty of Computer Science & Information Technology, Universiti Malaya, 2021. |
Uncontrolled Keywords: | Speaker identification; Feature fusion; Pattern recognition; LibriSpeech; Hierarchical classification |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science Q Science > QA Mathematics > QA76 Computer software |
Divisions: | Faculty of Computer Science & Information Technology > Dept of Information System |
Depositing User: | Mr Mohd Safri Tahir |
Date Deposited: | 13 Jan 2025 07:12 |
Last Modified: | 13 Jan 2025 07:12 |
URI: | http://studentsrepo.um.edu.my/id/eprint/14965 |
Actions (For repository staff only : Login required)
View Item |