Shayegan, Mohammad Amin (2015) Dataset size and dimensionality reduction approaches for handwritten farsi digits and characters recognition / Mohammad Amin Shayegan. PhD thesis, University of Malaya.
Abstract
In all pattern recognition systems, increasing the recognition speed and improvement of the recognition accuracy are two important goals. However, these items usually perform against each other, when the former is improved, the latter is decreased, and vice versa. In this thesis, the focus is on both items; decreasing the overall processing time and increasing the system accuracy. To such an aim, the number of training samples is decreased by proposing a technique for dataset size reduction that leads to decrease of the training/testing time. Also, the number of features is decreased by proposing a new technique for dimensionality reduction. It decreases the training and testing time, and by deleting less important features, it increases the system accuracy, too. The existing dataset size reduction algorithms, usually remove samples near to the centers of classes, or support vector samples between different classes. However, the former samples include valuable information about the class characteristics, and are important to make system model. The latter samples are important for evaluating system efficiency and adjustment of system parameters. The proposed dataset size reduction method employs Modified Frequency Diagram technique to create a template for each class. Then, a similarity value is calculated for each pattern. Thereafter, the samples in each class are rearranged based on their similarity values. Consequently, the number of training samples is reduced by Sieving technique. As a result, the training/testing time is decreased. In other part of this study, the number of extracted features is decreased by proposing a new method, which is, analyzing the one-dimensional and two-dimensional spectrum diagrams of standard deviation and minimum to maximum distributions for initial feature vector elements. In recent years, the attractive nature of Optical Character Recognition (OCR) has caused the researchers to develop various algorithms for recognizing different alphabets. Target performance for an OCR system is to recognize at least five characters per second with 99.9% accuracy. However, the performance of available handwritten Farsi OCR systems is still lacking, both in terms of accuracy and speed. The proposed techniques in this thesis have been validated in handwritten OCR domain via the use of two big standard benchmark datasets; the Hoda for Farsi digits and letters and the MNIST for Latin digits. The proposed dataset size reduction technique has been successful in decreasing the training time to less than half, while the accuracy has only decreased by 0.68%. Both datasets (Hoda and MNIST) were also used for dimensionality reduction purpose. Here, the dimension of feature vector was reduced to 59.40% for the MNIST dataset, 43.61% for digits part of the Hoda dataset, and 69.92% for the characters part of the Hoda dataset. Meanwhile the accuracies are enhanced 2.95%, 4.71%, and 1.92%, respectively. The achieved results showed the superiority of the proposed method compared to the rival dimension reduction methods. The proposed size reduction technique can be used for other pictorial datasets. Also, the proposed dimensionality reduction technique can be employed in any other pattern recognition systems with numerical feature vectors.
Actions (For repository staff only : Login required)