Named-entity recognition for numerical expression in Malay text-to-speech systems / Lit Wei Wern

Lit , Wei Wern (2019) Named-entity recognition for numerical expression in Malay text-to-speech systems / Lit Wei Wern. Masters thesis, Universiti Malaya.

[img] PDF (The Candidate's Agreement)
Restricted to Repository staff only

Download (207Kb)
    [img] PDF (Thesis M.A)
    Download (1258Kb)


      Text-to-speech (TTS) system is a system that able to convert text strings into human-like artificial speech. Natural Languages Processing (NLP) of the TTS system is the application of computational techniques to analysis and synthesis of natural language and speech. NLP must be able to convert non-words text material into words before they are synthesized. Some non-words texts like numbers are difficult to be converted to words because numbers have various formats such as calendar, time, currency, address, measurement units and so on. However, many of the existing TTS systems cannot accurately convert numbers which contain several types of numerical formats, leading to reduced intelligibility of the synthetic speech generated by the existing systems. The objective of this research is to automatically classify the numerical format of Malay text, to enable the NLP to perform the appropriate text to speech conversion, which will improve the intelligibility of the synthetic speech generated by the existing TTS systems. This research is categorized as named-entity recognition (NER) for numerical expression as it fits the definition of NER, which is the identification of certain entities in text. This research has proposed a context-based classification technique which classifies the numerical format of numbers, based on contexts of the numbers. Classifying the numbers into its appropriate format can assists in the accurate conversion of the numerical format into relevant text. This research has developed a classification system that able to classify six types of numeric contexts which consists of date, time, phone number, currency, measurement, and percentage. These formats were selected because they are the most commonly used in online news. A Malay text-numbers corpus containing over five hundred numerical formats with their sentences was built. There are four commonly used machine learning techniques were adopted for developing the classification system, which are the Support Vector Machine (SVM), K-Nearest Neighbors (KNN) Linear Discriminant Analysis (LDA), and Decision Tree (DT). 10-fold cross-validation and listening evaluation by native listeners was used for performance evaluation of the system. The confusion matrix is used to describe the performance of a classification model in more detail. Calculations on classification accuracy, precision, recall, and F-Measure were performed to each classifier. The highest classification mean accuracy achieved is 94.37% by using the context-based model as a features extractor, and DT as a classifier. For classifiers, the mean accuracies for SVM, KNN, and LDA were 93.86%, 91.07%, and 90.39%, respectively. In conclusion, the proposed solution was found to be effective in classifying the number format, and the accuracy of text conversion. From the listening test, this research has increased the intelligibility of the synthetic speech generated by the existing Malay TTS system which includes the numerical formats.

      Item Type: Thesis (Masters)
      Additional Information: Dissertation (M.A.) – Faculty of Computer Science & Information Technology, Universiti Malaya, 2019.
      Uncontrolled Keywords: Text-to-Speech (TTS); Named Entity Recognition (NER); Machine Learning (ML); Natural Language Processing (NLP)
      Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
      T Technology > T Technology (General)
      Divisions: Faculty of Computer Science & Information Technology
      Depositing User: Mr Mohd Safri Tahir
      Date Deposited: 26 Apr 2022 08:17
      Last Modified: 26 Apr 2022 08:17

      Actions (For repository staff only : Login required)

      View Item