Seyed Asadollah, Abdiesfandani (2016) Relevance detection and summarizing strategies identification algorithm using linguistic measures / Seyed Asadollah Abdiesfandani. PhD thesis, University of Malaya.
Abstract
Summarization is a process to select important information from a source text. Summarizing strategies are the core of the cognitive processes involved in the summarization activity. Summarizing strategies include a set of conscious tasks that are used to determine important information and extract the main idea of a source text. In this research project, we conducted a study on students’ summaries. The findings of the study show that, there is a strong relationship between the summary writing proficiency of students and the summarizing strategies that they used. We then develop a new algorithm to address the summarizing strategies identification problem. The algorithm simulates two important tasks that are frequently used by the human experts to identify summarizing strategies used to produce the summary sentences: 1) sentences relevance identification; and 2) summarizing strategies identification. The sentences relevance identification module uses a statistical based approach such as vector space model (VSM) to represent sentences and compute similarity between the source sentences and the summary sentences using the cosine similarity measure. It then integrates both the semantic and syntactic similarity measures using a linear equation to capture the meaning in comparison between two sentences. It aims to distinguish the meaning of two sentences, when two sentences have same surface or share the similar bag-of-words (BOW), while their meaning is different. The module also employed a word semantic similarity measuring method to overcome vocabulary mismatch problem in sentence comparison. The method bridges the lexical gaps for semantically similar contexts that are expressed in a different wording. In addition, the sentences relevance identification module requires some degree of linguistic pre-processing, including part of speech tagging (POS), word stemming and stop-words removal. iii The summarizing strategies identification module relies on a set of heuristic rules, statistical and linguistic methods such as position-based method, title-based method, cue-phrase method and word-frequency method to identify the summarizing strategies employed by students. To evaluate the algorithm, we conducted two experiments. In the first experiment, we examine the functionality of the system, whether the system is able to identify the summarizing strategies used by students in summary writing. The result for the first experiment shows that the system is able to identify some of summarizing strategies which are deletion, sentence combination, paraphrase and topic sentence selection. The system is also able to detect copy- verbatim strategy, the most commonly strategy used by students. Besides than these strategies, there are four methods used in topic sentence selection strategy which can also be identified by the system. They are 1) cue method; 2) title method; 3) keyword method; and 4) location method. In the second experiment, we want to measure the performance of the algorithm against human judgment to identify the summarizing strategies using the precision, recall, F-measure score and accuracy rate. The experimental results show that the proposed algorithm achieved acceptable results in comparison to human judgment. The algorithm achieved an average of 87% precision, 83% of recall, 85% of F-score and 82% of accuracy rate.
Actions (For repository staff only : Login required)