Optical Recognition System in Natural Scenes

Huaqi Nie
DPU-2040final-project
6 min readApr 5, 2021

--

Work conducted by: Haoda Song, Siyuan Li, Enmin Zhou, Yangyin Ke, Huaqi Nie

Introduction

“OCR technology deals with the problem of recognizing all kinds of different characters. Both handwritten and printed characters can be recognized and converted into a machine-readable, digital data format (Anyline)”.

In the following exploratory journey, we will build an Optical Character Recognition (OCR) system to recognize words from images of natural scenes. In order to better understand and improve the overall performance of the final model, the process could be separated into two parts:

· Text Detection with COCO 2014 dataset

· Text recognitions with the Google IIIT 5k word-only images.

In this initial post of the sequential blogs, we will introduce our exploratory data analysis of two datasets, two baseline models in text detection and recognition sections and discussion of future steps and improvements.

Exploratory Data Analysis

COCO 2014 Dataset

We are using the 2014train image set in COCO dataset (Common Object in Context) to train a text detection model in the first step. There are 82784 images in total. Specifically, around 150K of them contain texts, which can be divided into four groups based on their legibility (legible or illegible) and printed method (machine or handwritten).

Label distribution histogram

Our training dataset is generated from 13880 images with legible texts. Focusing on images with legible texts allows us to efficiently build a more accurate text detection model.

Google IIIT 5k Word Recognition

The IIIT 5K Word Recognition dataset is used to build the text recognition model. The dataset contains 2000 training images and 3000 test images. Each image has the corresponding labels in the word level. The images have been cropped by bounding boxes so we could directly use them as the inputs of text recognition models.

Sample images generated from the dataset.

Since the images only contain text and some of them are too dim to be recognized, resizing and denoising would be required in the following steps.

Baseline Model

Text Detection Model

We extract the image information from a .json file associated with the 2014 train image set. This file explains detailed characteristics of each image, including image id, text legibility, bounding box, displayed texts, etc. COCO API is used here to help connect the image set with the .json file to extract image details with corresponding image id. Output details are then loaded into our training dataset, prepared to be applied in model training.

Left: sample original image; Right: sample image with annotated bounding box and displayed texts

Model Architecture: We used the Faster-RCNN model with RestNet-50 backbone and fine-tuned on our particular classes as our baseline model for text detection. In Faster RCNN, a separate network is used to predict the region proposals and we don’t have to feed region proposals to the convolutional neural network every time. The predicted region proposals are then reshaped using a RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes. It is much faster than other R-CNN models in real-time objective detections.

Faster RCNN Architecture.

Evaluation: In the training process, the number of classes is two since we only care about background and target class and SGD optimizer was set with 0.0005 learning rate, 0.9 momentum and 0.0005 weight decays. We could see the training loss decrease as the epochs increase and stays constant, around 0.65, after 25 epochs. Although there are many different models that provide better performance, it is reasonable to set our baseline model as Faster-RCNN due to the time of training and the performance of training loss.

Training Loss plot

Text Recognition Model

Once we obtain the bounding boxes, the problem turns into another topic, Text Recognition. The key idea here is to predict the text from the image which is cropped by the boxes. For our baseline methodology, we implemented the pre-trained OpenCV East model with Tesseract V4 for our analysis. Tesseract is an open-source text recognition (OCR) Engine, which can be directly used to extract printed text from images. Meanwhile, it maintains a great benefit for supporting a wide variety of languages.

In the exploration, we used 2000 images from the Google IIIT 5k Word Recognition dataset for our training process. Several sample results of text recognition are presented as follows.

Sample outputs of the baseline model.

From the above plots, we saw both good and bad prediction results. To refine the text recognition model, we reflected the current situation with two directions. First, the bounding boxes are not satisfying. We can clearly observe that some parts of text are not detected, for instance, the first image “insurance”. Second, because of the different sizes of input images, we can observe that a great amount of noise is created as the images are being resized. Accordingly, we applied Image Denoising for the inputs with a function named cv2.fastNlMeansDenoisingColored() in OpenCV. The process is listed in the below figure.

The first image is the original, the second is the denoised, the third is resized after denoising, and the last is the blob image, which is for EAST model input.

In the next step of research, we hope to continue elevating the prediction performance with new methodologies. One good direction is the Convolutional Recurrent Neural Network (CRNN), which is a combination of CNN, RNN, and CTC(Connectionist Temporal Classification) loss for image-based sequence recognition tasks. Another direction is to experiment with different modes of Tesseract (OCR) Engine such as ‘Treat the image as a single text line’ or ‘Assume a single uniform block of text’, as modes can have dramatic influences on our output OCR results.

Further steps

Our OCR system follows the pipeline of text detection — text recognition — text correction. For text detection, we are going to train our object detection model using SSD — Single Shot Detector with a dataset of street view images — SVT.

For text recognition, we are going to use CTC — Connectionist Temporal Classification, as our loss function. This is a metric that can well handle the alignment between input and output in terms of word level. Our OCR system’s goal is to find a map from an image to a text sequence, which is also called temporal classification. Since image and text sequence both have variable and unfixed length and are hard to align manually, CTC introduces an empty character to deal with the problem above. Also CTC uses beam search to find the optimal output.

However, text recognition with CTC may not be enough to ensure the accuracy of the output. So we might need to incorporate BERT as an additional tool to correct our output.

Reference

  1. OCR definition: https://anyline.com/news/what-is-ocr/
  2. Faster R-CNN: https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e#:~:text=The%20reason%20%E2%80%9CFast%20R%2DCNN,map%20is%20generated%20from%20it.
  3. Faster R-CNN: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks: Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, arXiv:1506.01497v3 [cs.CV] 6 Jan 2016, https://arxiv.org/abs/1506.01497
  4. EAST: An Efficient and Accurate Scene Text Detector Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, Jiajun Liang, d 10 Jul 2017,https://arxiv.org/abs/1704.03155
    https://jaafarbenabderrazak-info.medium.com/opencv-east-model-and-tesseract-for-detection-and-recognition-of-text-in-natural-scene-1fa48335c4d1
  5. CTC: Connectionist Temporal Classification : Labelling Unsegmented Sequence Data with Recurrent Neural Networks. Graves, A., Fernandez, S., Gomez, F. and Schmidhuber, J., 2006. Proceedings of the 23rd international conference on Machine Learning, pp. 369–376. DOI: 10.1145/1143844.1143891
    COCO dataset: https://cocodataset.org/#download
    COCO API: https://github.com/andreasveit/coco-text
  6. OpenCV: https://learnopencv.com/deep-learning-based-text-detection-using-opencv-c-python/
  7. Tesseract: https://towardsdatascience.com/simple-ocr-with-tesseract-a4341e4564b6
    https://nanonets.com/blog/ocr-with-tesseract/#:~:text=Tesseract%20is%20an%20open%20source,a%20wide%20variety%20of%20languages.

--

--