Development of a corpus and a parser for written Malaysian Tamil / Elanttamil Maruthai

Elanttamil , Maruthai (2022) Development of a corpus and a parser for written Malaysian Tamil / Elanttamil Maruthai. PhD thesis, Universiti Malaya.

	PDF (The Candidate's Agreement) Restricted to Repository staff only Download (201Kb)
	PDF (Thesis PhD) Download (4Mb)

Abstract

Tamil is a classical language with ancient heritage, existing without any break or interruption in its long history. Over the years, the grammar and lexicon of Tamil have undergone changes. Some old linguistic features have disappeared while some new features have emerged. All these require serious investigation, especially given the current developments in corpus and computational linguistic research. It is therefore surprising that limited research has taken advantage of these technological advancements in studying Tamil in general and Malaysian Tamil in particular. The present thesis aims to address this concern by developing the first corpus of Written Malaysian Tamil (WMTC). Based on this WMTC, the Tamil language in Malaysia can be analysed authentically. The thesis describes the creation and development of this WMTC, which consists of one million words, spanning eight different genres, with a total of 500 samples of text, each containing about 2000 words, comprising texts from periodicals, popular magazine, Internet, school textbooks, fiction, academy journal, to be spoken and unclassified category. Since Tamil is a morphologically rich language, the present thesis further develops an automatic Tamil unified parser and POS tagger software. This thesis discusses the development of this software, which consists of a morphological parser, a tagger, an N-gram tool and a concordancer for the linguistic analysis of Tamil. Given the scope of research, this thesis focuses only on the morphological analysis of written Tamil as an illustration of how the software can be used to analyse the morphological aspects of written Malaysian Tamil and possibly other varieties of Tamil. One innovation of this parser is the introduction of Tamil computational algorithm into the parser, which makes the analysis and processing of morphological features possible. 51 POS tags were developed for this research project. In addition, the noun and verb inflection charts explaining the computational morphotactics of Tamil words were developed along with lists of tokens, types and lemmas. This thesis makes two major contributions to corpus and computational linguistic research: the creation and development of a corpus and a parser. All this is paving the way for future research in language technology, natural language processing, corpus development and computational linguistic research. This current research also has important implications for Tamil language pedagogy and language planning.

Item Type:	Thesis (PhD)
Additional Information:	Thesis (PhD) – Faculty of Languages and Linguistics, Universiti Malaya, 2022.
Uncontrolled Keywords:	Corpus; Parser; Malaysian Tamil; Classical language; Computational linguistic
Subjects:	P Language and Literature > P Philology. Linguistics
Divisions:	Faculty of Languages and Linguistics
Depositing User:	Mr Mohd Safri Tahir
Date Deposited:	08 Sep 2025 02:42
Last Modified:	08 Sep 2025 02:42
URI:	http://studentsrepo.um.edu.my/id/eprint/15728

Actions (For repository staff only : Login required)

View Item