Public Scene Text Detection and Recognition



Low, Boon Keong (2018) Public Scene Text Detection and Recognition. Final Year Project (Bachelor), Tunku Abdul Rahman University College.

[img] Text
Low Boon Keong.pdf
Restricted to Registered users only

Download (2MB)


Scene text detection and recognition have various applications in real world especially in navigation, data retrieval, helping visual impaired groups, package labelling in industrial automation, logistics and others. Due to its wide application areas, this research focuses on detecting and recognizing horizontal and multi-oriented text with a hope of further improving the accuracy. There are various existing approaches proposed and developed by researchers and computer vision communities in order to further improve the accuracy so that the whole system can work robustly and reliably in real time environment. For instance, in conventional text detection approach, sliding window slides over image to obtain text region, and connected-component based approach performs pixel-level detection. Deep learning approach then outperforms in scene text detection and recognition. In this project, it is proposed to apply object detection concept and framework to perform text detection on a natural scene image. Faster Region Convolutional Neural Network (Faster-RCNN) model is proposed for text detection while convolution recurrent neural network (CRNN) model is proposed for text recognition. In Faster-RCNN model, a slight change is made on the region proposal network (RPN) at which anchors with smaller scales and ratios are proposed. Smaller anchor is proposed as to detect or localize small letter text in natural scene image. The rest of the detection framework mainly follows Faster-RCNN model proposed by (Ren et al., 2017). On text recognition, CRNN model is proposed. It applies convolution on cropped image or text region outputted by Faster-RCNN to obtain feature sequence. The feature sequence is then fed into recurrent layers made by bidirectional long short term memory (BLSTM) units in order to perform text prediction from the feature sequence. The output of BLSTM will be a text prediction sequence. Transcription is then performed by connectionist temporal classification (CTC) to classify the text prediction sequence according to text characters or blank classes. The output from CTC will be a text label sequence which consists all English characters. Datasets like ICDAR 2013 Focused Scene Text (ICDAR 2013) and Google Street View Text (Google SVT) are used for system training and testing. In this proposed approach for text detection, Faster-RCNN achieves a slightly low accuracy of 27.7% on ICDAR 2013 and 19.9% on Google SVT dataset. On text recognition part, CRNN achieves a high accuracy of 86.67% on ICDAR 2013 dataset and 80% on Google SVT dataset.

Item Type: Final Year Project
Subjects: Technology > Electrical engineering. Electronics engineering
Faculties: Faculty of Engineering and Technology > Bachelor of Engineering (Honours) Electrical and Electronics
Depositing User: Library Staff
Date Deposited: 10 Oct 2018 07:30
Last Modified: 10 Oct 2018 07:30