Hussein Soori, Hussein Khaled (2015) Using verb-noun collocation for disambiguating verb polysemy in English-Arabic statistical machine translation / Hussein Khaled Hussein Soori. PhD thesis, University of Malaya.
PDF (Full Text) Restricted to Repository staff only until 01 August 2017. Download (2919Kb) |
Abstract
This thesis attempts to resolve the problem of verb-noun collocation in English-Arabic Machine Translation engines. This problem can be seen from the semantic ill-formed output produced by current machine translation systems when the wrong verb synonym is chosen for the Arabic translation. Initially, this problem starts when a given engine tries to select from a set of polysemous verbs in English to find the equivalent meaning of the verb in Arabic. Mostly, this selection depends on the syntactic environment and verb semantic features serving as selectional restrictions. These selectional restrictions can be very effective when it comes to solving verb polysemantic ambiguity, but lead to a dead end when trying to find the verb that collocates most with the noun in the output Arabic translation. To resolve this problem, this work uses a statistical method inspired by Church et al. (1991) in a prototype designed to retrieve verb-noun collocates in Arabic. The testing data sets for this prototype were chosen from various topics. Two multi-domain corpora in modern standard Arabic were chosen for this work: the Contemporary Corpus of Arabic and the Arabic Corpus by Mourad Abbas. The total number of words in the chosen corpora is 14 million words. The testing data sets were translated by Google, Bing and the prototype designed for this thesis. For the evaluation of these three engines, a simple metric was proposed including a gold standard value for the nounverb collocation in the Arabic translation. According to the evaluation metric, the results showed that Bing scored a verb-noun collocation value of 0.72, Google scored a collocation value of 0.75 and the prototype scored a collocation value of 0.89. The final results showed that the average performance rate for Bing is between 0.65-0.67, the average performance rate for Google is between 0.63-0.85 and the average performance rate for the prototype is between 0.82-0.88. This thesis shows that retrieving the verb that collocates most with the noun in Arabic corpora is a sophisticated task, due to the highly inflectional and agglutinated nature of Arabic where particles, personal pronouns (both for subject and object) and possessive pronouns are agglutinated to the verb in Arabic texts. This task involves two aspects: choosing the query of the search and the distance between the noun and the verb. Choosing the query for the noun and the verb is highly governed by the verb conjugation and noun declension. This requires modifying the search query (stem or lemma) according to the verb features such as tense, number, mood, aspect, etc., and noun features such as, number, gender, definitiveness, case and possessive clitic. Furthermore, decreasing the search distance may lead the search results to ignore some tangible collocation results, but increasing the distance can lead to the inclusion of some noise results. Keywords: English-Arabic machine translation; verb-noun collocation in Arabic; statistical machine translation; collocation retrieval, polysemy and collocation; Arabic corpora
Item Type: | Thesis (PhD) |
---|---|
Additional Information: | Thesis (PhD.) -– Faculty of Languages and Linguistics, University of Malaya, 2015 |
Uncontrolled Keywords: | English-Arabic machine translation; Verb-noun collocation in Arabic; Statistical machine translation; Collocation retrieval; Polysemy and collocation; Arabic corpora |
Subjects: | P Language and Literature > P Philology. Linguistics P Language and Literature > PE English P Language and Literature > PJ Semitic |
Divisions: | Faculty of Languages and Linguistics |
Depositing User: | Mrs Nur Aqilah Paing |
Date Deposited: | 20 Nov 2015 16:00 |
Last Modified: | 20 Nov 2015 16:00 |
URI: | http://studentsrepo.um.edu.my/id/eprint/6018 |
Actions (For repository staff only : Login required)
View Item |