Automated scanned receipt processing with optical character recognition and machine learning / Hor Zhang Neng

Hor, Zhang Neng (2022) Automated scanned receipt processing with optical character recognition and machine learning / Hor Zhang Neng. Masters thesis, Universiti Malaya.

[img] PDF (The Candidate's Agreement)
Restricted to Repository staff only

Download (209Kb)
    [img] PDF (Thesis M.A)
    Download (2110Kb)


      Text detection and recognition in parsing optical character recognition (OCR) receipts are less studied than other popular OCR tasks. Study for post-OCR parsing of receipts is scarce, which opens up the opportunity to explore extracting key information from receipts and classifying them. This dissertation explores how the OCR and machine learning (ML) techniques can optimize and automate receipt handling for reimbursement purposes. Automating the reimbursement process keeps faulty reimbursement expense reporting behaviour to a minimum and speeds up employee claims. The dataset prepared for this work consists of one hundred receipts commonly found in Malaysia's employee expense reimbursement report. The receipts are organized into six categories: meals, groceries, petrol, accommodation, telecommunication, and transportation fares. The receipts are of Malaysian origin, and the language of receipts is restricted to only containing English text. This work does not consider parsing handwriting on the receipt nor addresses text ambiguity. The text processing accuracy follows the accuracy of the OCR tool selected. This dissertation proposes three objectives; developing an image processing framework in improving receipt quality pre-parsing, recognizing text and extracting key information from receipts using the OCR technique, and evaluating the ML classifiers in improving receipt classification post-parsing. The overall text extraction is 90.72% and 78.51% accurate at character and word level, with harmonic mean of the precision and recall, F1 score of 0.89 and 0.78. Overall accuracy for key information extraction is 74.33%, with an F1 score of 0.74. Seven ML classifiers, Naive Bayes, maximum entropy, Support Vector Machine (SVM), linear Support Vector Classifier (SVC), k-nearest neighbours (KNN), decision tree and random forest, were compared. They perform between 52% and 80% overall, with F1 scores between 0.55 and 0.79. Interestingly, the linear SVC has the highest score and accuracy for its searching capability in finding the best dividing field that separates high-dimensional text data into classes.

      Item Type: Thesis (Masters)
      Additional Information: Dissertation (M.A.) – Faculty of Computer Science & Information Technology, Universiti Malaya, 2022.
      Uncontrolled Keywords: Understanding receipts; OCR parsing; Machine learning classification; Reimbursement process
      Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
      Divisions: Faculty of Computer Science & Information Technology
      Depositing User: Mr Mohd Safri Tahir
      Date Deposited: 28 May 2023 03:18
      Last Modified: 28 May 2023 03:18

      Actions (For repository staff only : Login required)

      View Item