Corpus-driven Malay language tweet normalization / Mohammad Arshi Saloot

Mohammad Arshi , Saloot (2018) Corpus-driven Malay language tweet normalization / Mohammad Arshi Saloot. PhD thesis, University of Malaya.

[img] PDF (The Candidate's Agreement)
Restricted to Repository staff only

Download (1516Kb) | Request a copy
    [img] PDF (Thesis PhD)
    Restricted to Repository staff only until 31 December 2020.

    Download (3176Kb) | Request a copy

      Abstract

      The expeditious spread of blogs, microblogs, and social network services has led to accelerate the usage of casual written language, known as user generated content (UGC). The UGC diverges from standard writing conventions because of the usage of coding strategies, such as phonetic transcriptions (are → r), digit phonemes (me too → me2), misspellings (misappropriate → missapropriate), vowel drops (double → dble), and missing or incorrect punctuation marks (In that situation, I'd possibly come. → In that situation Id possibly come). These modifications are due to three primary elements: 1) limited message length (e.g. 140 characters per Tweet); 2) miniature keyboards; and 3) extensive usage of UGC in unofficial and informal communications. However, the existence of many out-of-vocabulary (OOV) words, also known as unknown words, substantially disturbs standard natural language processing (NLP) systems. Therefore, research in NLP has increasingly focused on the text normalization task, where the OOV words will convert into their context-appropriate standard words. Currently, while diverse normalization approaches exist in the English language, the problem is neglected in other languages, such as Malay language. In this work, the Malay language is chosen because of its considerable usage on Twitter, where, it is the fourth leading language used in Twitter. Thus, a rule-based approach to normalize the Malay language Twitter messages is proposed based on corpus-driven analysis. To do so, a corpus-driven analysis depends on frequencies in specifying word-frequency lists, concordancing, clusters, and keywords. To design the normalization system, three analyzing tasks on the Malay language Twitter corpus and standard Malay corpus were performed: 1) frequency of unknown words; 2) abbreviation patterns; and 3) letter repetition. A Malay language Twitter corpus known as Malay Chat-style Corpus (MCC) is constructed. The MCC, which encompasses 1 million twitter messages, consists of 14,484,384 word instances, 646,807 unique vocabularies, and metadata, such as used Twitter client application, posting time, and type of Twitter message (simple Tweet, Retweet, Reply). To build the MCC, which represents the Malay language Twitter lingo, corpus-compiling criteria were considered which are: sampling, representativeness, machine readability, balance, and size of data. A portion of the MCC is manually annotated to be used in the development and testing stages of the normalization system. The architecture of the Malay normalization system contains seven primary modules: (1) enhanced tokenization; (2) In-Vocabulary (IV) detection; (3) colloquial dictionary lookup; (4) repeated letter elimination; (5) abbreviation normalizer; (6) English word translation; and (7) de-tokenization. The normalization modules are formulated based on the result of MCC analysis and implemented via rule-based state machines. An evaluation is performed in term of BLEU score to measure the accuracy of the system. The result is encouraging whereby 0.91 BLEU score is achieved against 0.46 BLEU baseline score. To compare the accuracy of the system with other probabilistic approaches with an identical Malay dataset, statistical machine translation (SMT) normalization system is chosen to be implemented, trained, and evaluated. The experimental results prove that higher accuracy is achieved by the proposed architecture, which is designed based on the results of our corpus-driven analysis.

      Item Type: Thesis (PhD)
      Additional Information: Thesis (PhD) – Faculty of Computer Science & Information Technology, University of Malaya, 2018.
      Uncontrolled Keywords: Casual written language; Standard natural language processing (NLP) systems; Malay language; Social media platform
      Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
      Divisions: Faculty of Computer Science & Information Technology
      Depositing User: Mr Mohd Safri Tahir
      Date Deposited: 27 Sep 2018 03:12
      Last Modified: 27 Sep 2018 03:12
      URI: http://studentsrepo.um.edu.my/id/eprint/8982

      Actions (For repository staff only : Login required)

      View Item