Cyber parental control framework for objectionable web content classification and filtering based on topic modelling using enhanced latent dirichlet allocation / Hamza H. M. Altarturi

Hamza H. M. , Altarturi (2023) Cyber parental control framework for objectionable web content classification and filtering based on topic modelling using enhanced latent dirichlet allocation / Hamza H. M. Altarturi. PhD thesis, Universiti Malaya.

	PDF (The Candidate's Agreement) Restricted to Repository staff only Download (194Kb)
	PDF (Thesis PhD) Download (2634Kb)

Abstract

The escalating concern revolves around cybersecurity for children, given the unprecedented internet access that potentially exposes them to objectionable content. Recent data highlight the problem's severity, revealing a 97% surge in children's online exploitation and a 28% rise in reported minor sexual abuse material online. This problem has motivated academia and industry to develop frameworks for cyber parental control. Despite substantial advancements in automating web classification that combines web mining and content classification methods, the study identifies a gap in applying advanced machine learning algorithms for superior objectionable web content classification. Most existing studies adopt one classifying approach, resulting in an ineffective and unreliable classification of objectionable content. In terms of content, only a few studies address a wide range of objectionable content topics, whereas most studies primarily focus on pornography topics. Furthermore, studies on classifying objectionable contents use conventional topic models, such as the Latent Dirichlet Allocation (LDA) and its variants. These models are built to work on generic fields and conventional documents, ignoring the structure of web content in the HTML documents and insufficiently performing when applied to web content data. Neglecting the unique structure of web content leads to missing the otherwise interpretable topics and, therefore, to low topic quality and classification accuracy. Moreover, the lack of publicly accessible objectionable web content ground-truth datasets has prevented a fair, coherent comparison of the various frameworks. This research aims to propose an effective and accurate framework for classifying objectionable web content. The Cyber Parental Control Framework (CPCF) employs a multistep approach and a novel web mining technique. It uses the URL blacklist and whitelist methods as the first and second filter layers. A final classification layer is then applied in which an HTML Topic Model (HTM) developed by this study analyses HTML tags to understand the structure of the webpages. The HTM is an enhancement of the LDA model. This study creates a ground-truth objectionable web content dataset to achieve the aim. The ground-truth dataset contains 8,000 labelled websites, split equally between objectionable and unobjectionable websites and comprising over 2 million pages. The study conducted four series of experiments to examine the CPCF. The first experiment’s results demonstrate the reliability of the ground-truth dataset using the existing state-of-the-art classifiers. The results of the second experiment demonstrate the limitations of the existing topic models web applied to web content. The third experiment then evaluates the effectiveness of the HTM in discovering interpretable topics and term patterns compared to the widely used LDA model. The final experiment investigates the performance and accuracy of the CPCF using the HTM model. The CPCF demonstrates effectiveness in web content classification and the ability to overcome the limitations of the existing methods. Finally, a web-based functional prototype was developed to facilitate the CPCF’s applicability and to offer a valuable reference for future research and prospects in this domain. The contribution of this study is a framework to produce an objectionable web content classification for cyber parental control, which was proposed, designed, evaluated, and simulated.

Item Type:	Thesis (PhD)
Additional Information:	Thesis (PhD) – Faculty of Computer Science & Information Technology, Universiti Malaya, 2023.
Uncontrolled Keywords:	Machine learning; Natural language processing; Latent dirichlet allocation; Topic modeling, Language model; Cyber security; Web classification
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:	Faculty of Computer Science & Information Technology
Depositing User:	Mr Mohd Safri Tahir
Date Deposited:	23 Sep 2024 06:59
Last Modified:	23 Sep 2024 06:59
URI:	http://studentsrepo.um.edu.my/id/eprint/15291

Actions (For repository staff only : Login required)

View Item