A phonetically rich and balanced lexical corpus using zipfian distribution for an under resourced language / Aminath Farshana

Aminath , Farshana (2018) A phonetically rich and balanced lexical corpus using zipfian distribution for an under resourced language / Aminath Farshana. Masters thesis, University of Malaya.

[img] PDF (The Candidate's Agreement)
Restricted to Repository staff only

Download (207Kb)
    [img]
    Preview
    PDF (Thesis M.A)
    Download (1058Kb) | Preview

      Abstract

      In recent times, speech technology and its related applications are becoming a popular topic among researchers. There are many applications of speech technology developed for businesses, military, transport, aerospace, PDAs, and so on. The importance of speech technology-based applications has prompted researchers to improve the techniques of these applications for many languages around the world. However, only limited number of languages benefited from speech technology applications such as the Automatic Speech Recognition (ASR) system and the Text-to-Speech (TTS) system. One of the main reasons for this technological gap between the languages is the lack of basic resources such as the lexical and speech corpus, which are essential as the foundation for developing this technology. Though researchers have managed to assemble these basic resources for some languages, the methods used for accumulating them are not as efficient as of the established languages. Some of these methods also depend on the types of resources needed for developing lexical and speech corpora. This research emphasizes on developing a lexical corpus for an under-resourced language that lacks the basic resources. This research also focuses on improving the quality of the corpus in terms of phonetic coverage and corpus size for the related under-resourced language. Developing a lexical corpus includes collecting an initial large corpus, and selecting suitable sentences therein. The selected set of sentences must cover all possible phonetic units of the language and ensuring uniform distribution of those units. This research proposed a novel method the development of a lexical corpus for Dhivehi, a language that lacks in key resources for developing speech technology-based applications. This research proposed the use of Zipfian distribution for selecting sentences from the initial large corpus. From 109,208 sentences collected from web sources, 360 sentences were selected to ensure a phonetically rich and balanced lexical corpus. The performance of the developed corpus is evaluated in terms of phonetic coverage and size of the corpus. Phonetic coverage is measured by finding the sum of the sequence of phonemes in the corpus. The size of the corpus is evaluated using the cosine similarity, which measures the frequency distribution of the phonemes occurring in the developed final corpus and comparing them with the large initial corpus. The closer the similarity between final and large corpus, the better is the phonetic coverage. High similarity between the two corpora indicates that the developed corpus using the proposed method can perform as efficient as the initial large corpus. Statistical phonetic unit distribution similarity of selected sentences was 0.988 as compared to phonemes distribution of the large corpus. Since the similarity of the two distributions is close, it means that the optimized corpus can perform as efficient as the larger corpus. The performance of the proposed method was also evaluated by comparing the results with an existing benchmark method (greedy algorithm). The results show that the sentences selected using proposed method cover all the phonetic units and is 14 times smaller than the corpus developed using the benchmark method.

      Item Type: Thesis (Masters)
      Additional Information: Dissertation (M.A.) – Faculty of Computer Science & Information Technology, University of Malaya, 2018.
      Uncontrolled Keywords: Lexical corpus; Text-to-Speech (TTS) system; Automatic Speech Recognition (ASR) system; Zipfian distribution; Phonetic
      Subjects: Q Science > QA Mathematics > QA76 Computer software
      T Technology > TA Engineering (General). Civil engineering (General)
      Divisions: Faculty of Computer Science & Information Technology
      Depositing User: Mr Mohd Safri Tahir
      Date Deposited: 31 Jan 2021 03:51
      Last Modified: 31 Jan 2021 03:51
      URI: http://studentsrepo.um.edu.my/id/eprint/11964

      Actions (For repository staff only : Login required)

      View Item