Mohammed Salem , Abdullah Kaity (2020) A semi-automatic integrated framework for non-English sentiment lexicons / Mohammed Salem Abdullah Kaity. PhD thesis, Universiti Malaya.
PDF (The Candidate's Agreement) Restricted to Repository staff only Download (194Kb) | |
PDF (Thesis PhD) Download (2673Kb) |
Abstract
There has been significant growth in social media networks in the last few years. Posting opinions and messages on social networking websites has become a popular activity on the Internet. The data sources are necessary for business intelligence and market analytics, as human opinions form a major indicator of human desires and behaviour. This has resulted in the development of a new study field called sentiment analysis. This includes the analysis, evaluation and interpretation of the opinions with the help of text mining and Natural Language Processing (NLP) processes, for identifying the text polarity, as positive, neutral or negative. It is important to build sentiment analysis resources before developing the sentiment analysis models. The sentiment lexicons are seen to be a major resource which includes a list of phrases and opinion words along with their sentiment orientation. Literature review revealed that though many texts are available which are written in different languages, a majority of the sentiment analysis studies have focused on those written in English. Hence, the other non-English languages noted a shortage of lexicons and resources. Also, the techniques used for building the sentiment lexicons in non-English languages display many disadvantages like their inability to handle a particular domain, informal use of language expression and vocabulary used in the social media feeds. Furthermore, a few of the non-English sentiment lexicons also have to face translation issues and are plagued by the cultural difference when they are translated from different languages. To overcome the issues which are noted while building the non-English lexicons, a language-independent integrated framework has been proposed in this work which semi-automatically builds the non-English sentiment lexicons based on the available English lexicons with an unannotated corpus from the target language. This framework includes three layers, i.e., corpus-based, lexicon-based, and human-based. The first two layers can automatically recognise and then extract the novel polarity words from the huge unannotated corpus, with the help of the initial seed lexicons. The major advantage of this framework is that it needs only an initial seed lexicon and an unannotated corpus for initiating the extraction activity. This framework is seen to be semi-supervised owing to the usage of the seed lexicons. Experiments on three languages have been carried out and the proposed framework output has shown a better performance than the existing lexicons. The F-measure values for the Arabic, French and Malay lexicons were seen to be 0.778, 0.838 and 0.686, respectively.
Item Type: | Thesis (PhD) |
---|---|
Additional Information: | Thesis (PhD) – Faculty of Computer Science & Information Technology, Universiti Malaya, 2020. |
Uncontrolled Keywords: | Sentiment lexicon; Sentiment analysis; Text analysis; Natural language processing; Building resources |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science Q Science > QA Mathematics > QA76 Computer software |
Divisions: | Faculty of Computer Science & Information Technology |
Depositing User: | Mr Mohd Safri Tahir |
Date Deposited: | 22 Jun 2023 06:54 |
Last Modified: | 22 Jun 2023 06:54 |
URI: | http://studentsrepo.um.edu.my/id/eprint/14485 |
Actions (For repository staff only : Login required)
View Item |