Optical Character Recognition System in Natural Scenes — Part3

Published in

DPU-2040final-project

7 min readApr 19, 2021

Work conducted by: Haoda Song, Siyuan Li, Enmin Zhou, Yangyin Ke, Huaqi Nie

Introduction

Optical Character Recognition (OCR) has long been a challenging computer vision and deep learning problem. The basic steps of Optical Character Recognition (OCR) consist of three major steps, Text Detection, Text Recognition, and Post-Processing.

As Text Recognition in a natural environment become more sophisticated because of frequent blurs, distortions, and clutters, in this paper, we hope to construct a model that takes advantages of convolutional neural networks (CNNs), recurrent neural networks (RNNs) and the architecture of Single Shot MultiBox Detector (SSD) to extract characters.

In our previous blog1 and blog2, we have discussed and built the models in two sections. In this blog, we would finalize our work and make a conclusion.

Data Overview

We use Google IIIT5K to train the text recognition model and fine-tune the pre-trained text detection model on Google Street View dataset. Google IIIT5K data contains 5000 cropped word images from scene texts and born-digital images. The pre-trained SSD text detection model is trained on the ICDAR competition dataset and then we fine-tune the model on Google Street View dataset (STV). SVT contains approximately 400 ground-truth scene text bounding boxes from 100 natural street view images harvested from Google Map. To simplify the process of YOLO loss computation, we calculate the upper left corner and lower right corner of the bounding box with original height and width, which are then further converted into YOLO matrix.

From this preprocessed set, we split our data into train and validation so that approximately 300 bounding boxes with 75 images are in the training set and 80 bounding boxes with 25 images are in the validation set.

Final Model Pipelines

In Text Detection, SSD outperformed the traditional models such as Faster R-CNN and YOLO. It uses VGG-16 model pre-trained on ImageNet as the base model and adds multiple convolutional layers of decreasing sizes as the top. VGG-16 extracts the useful image features from images and then object detection happens in each convolutional layer. The decreasing sizes prevent the objects with various sizes, either large or small, could be captured well.

In Text Recognition, Convolutional Recurrent Neural Network (CRNN) is a novel neural network architecture that integrates the advantages of both Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). As the text recognition task has long been performed in deep learning study, CRNN model can take input images of different dimensions and return the results with different lengths. We can directly put coarse text input in the training process without detailed annotations. CRNN does not utilize fully connected layers that are widely applied in CNN. This act helps the model be a much more compact and efficient model.

We outline our pipeline OCR architecture in Figure 1.

Here is the two architecture of our models:

Results

Text detection model (SSD MobileNet V2):

Compared with the pre-trained model, our fine-tuned model does make some improvements. After fine-tuning the pre-trained model with our street view dataset, we get a model with optimized weights that works better to distinguish between text and non-text bounding boxes. Additionally, our fine-tuned model is more sensitive to large texts than the pre-trained model. However, improvements are still needed regarding large-text sensitivity. In terms of recognizing blurred and tiny texts, our model is performing well, capable of predicting clear bounding boxes for distant texts.

Figure 4 Bounding boxes predicted by pre-trained model V.S fine-tuned model

Text Recognition model (CRNN):

For our final CRNN text recognition model, we trained 4000 images from the Google IIIT5K dataset and selected 500 images for the validation set. To understand our text recognition model in advanced perspective, we calculated the accuracy score for the prediction result based on two different levels. One is the character level, and the other is the word level. We summarize the model improvement from the baseline to the find-tuned in the above table. We can see that our model has a satisfying detection accuracy given the limited resource of text image data.

Overall Result

Combining the text detection model and the text recognition model, we are able to read the texts in images. The complete pipeline performs well in most cases, with accurate bounding boxes and text recognition. However, in a
sense, if bounding boxes can provide closer boundary lines, avoiding containing letters of the next word, the text recognition model will be given a more concise input to better recognize corresponding words.

Application

The OCR application is set up on the Android platform with SDK 28 and NDK bundle 18r. The application depends on the external packages org.tensorflow — android and org.tensorflow — lite. For model, we put our trained model frozen graph in assets in the form of protobuf. For input, we can choose an image either from a gallery or camera. Assets are read as a stream and the application decodes the stream as bitmap for model input. The output from the model, which are bounding boxes and word predictions, will be embedded onto the image. Finally, the synthesized image is binded to the view of the main activity on the screen.

Future Discussion

We conclude our model detects and recognizes the natural scene text from street view accurately, especially for small text which is far away from the photographers. Additionally, the complete pipeline performs well on recognizing artistic text. However, in a sense, if bounding boxes can provide closer boundary lines, avoiding containing letters of the next word, the text recognition model will be given a more concise input to better recognize corresponding words. Therefore, in order to balance the tradeoff between the small and large sizes of text from the natural images, our model loses some ability to detect and recognize the extremely large text object. Besides the lack of extremely large text detection, the outputs of the model are unable to recognize the special characters.

However, one of the potential improvements of the model could be using larger dataset to train the model so that the model would capture more styles and patterns of text. Post OCR correction would also enhance the accuracy of the predictions, such as the correction of the output using BERT but this method may require a large lexicon.

Furthermore, the OCR application could be set up on the Android platform with SDK 28 and NDK bundle 18r. The application depends on the external packages org.tensorflow — android and org.tensorflow — lite. For model, we put our trained model frozen graph in assets in the form of protobuf. For input, we can choose an image either from a gallery or camera. Assets are read as a stream and the application decodes the stream as bitmap for model input. The output from the model, which are bounding boxes and word predictions, will be embedded onto the image. Finally, the synthesized image is binded to the view of the main activity on the screen.

Reference

[1] Mishra, A. and Alahari, K. and Jawahar, C.~V. (2012). Scene Text Recognition using Higher Order Language
Priors [Dataset]. http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html
[2] Neeraj Panse. (2019). Text-Detection-using-Yolo-Algorithm-in-keras-tensorflow [Dataset]. https://github.com/Neerajj9/Text-Detection-using-Yolo-Algorithm-in-keras-
[3] Kai Wang, Boris Babenko and Serge Belongie. (2010). Street View Text | UC San Diego [Dataset]. http://vision.ucsd.edu/%7Ekai/svt/
[4] What is OCR? Introduction to Optical Character Recognition | Anyline. (n.d.). ANYLINE. https://anyline.com/news/what-is-ocr/
[5] Gandhi, R. (2018, December 3). R-CNN, Fast R-CNN, Faster R-CNN, YOLO — Object Detection Algorithms. Medium. https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-
[6] Ren, S. (2015, June 4). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. ArXiv.Org. https://arxiv.org/abs/1506.01497
[7] Benabderrazak, J. (2020, April 9). OpenCV EAST model and Tesseract for detection and recognition of text in
natural scenes. Medium. https://jaafarbenabderrazak-info.medium.com/opencv-east-model-and-tesseract-fordetection-and-recognition-of-text-in-natural-scene-1fa48335c4d1
[8] Shrimali, V. (2021, March 20). Deep Learning based Text Detection Using OpenCV. Learn OpenCV | OpenCV, PyTorch, Keras, Tensorflow Examples and Tutorials. https://learnopencv.com/deep-learning-based-text-detectionusing-opencv-c-python/
[9] Liu, W. (2015, December 8). SSD: Single Shot MultiBox Detector. ArXiv.Org. https://arxiv.org/abs/1512.02325
[10] Weng, L. (2018, December 27). Object Detection Part 4: Fast Detection Models. Lil’Log. https://lilianweng.github.io/lil-log/2018/12/27/object-detection-part-4.html#ssd-single-shot-multibox-detector
[11] Liao, M. (2018, January 9). TextBoxes++: A Single-Shot Oriented Scene Text Detector. ArXiv.Org. https://arxiv.org/abs/1801.02765
[12] Zero to Hero: Guide to Object Detection using Deep Learning: Faster R-CNN,YOLO,SSD. (2017, December 28). CV-Tricks.Com. https://cv-tricks.com/object-detection/faster-r-cnn-yolo-ssd/
[13] E. (2020a). emedvedev/attention-ocr. GitHub. https://github.com/emedvedev/attention-ocr
[14] S. (2020b, August 25). Handwriting Recognition using CRNN in Keras. Kaggle. https://www.kaggle.com/samfc10/handwriting-recognition-using-crnn-in-keras
[15] M. (2019). MaybeShewill-CV/CRNN_Tensorflow. GitHub. https://github.com/MaybeShewillCV/CRNN_Tensorflow
[16] Use a TensorFlow Lite model for inference with ML Kit on Android. (2021). Firebase. https://firebase.google.com/docs/ml-kit/android/use-custom-models
[17] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen; MobileNetV2: Inverted Residuals and Linear Bottlenecks; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520