Optical Character Recognition System in Natural Scenes — Part2
Work conducted by: Haoda Song, Siyuan Li, Enmin Zhou, Yangyin Ke, Huaqi Nie
Introduction
In the previous blog, we set up our two baseline models for text-detection and text-recognition.
In the current stage, different architectures for two steps were researched and implemented. As the results, Single Shot Detector 512 with MobileNetv2 base performed the best on Google Street View Dataset and the Google IIIT 5k word-only images.
Text-Detection:
Single Shot Detector (SSD)
SSD was invented in 2016 by Wei Liu and other researchers on the paper, SSD: Single Shot MultiBox Detector. It uses VGG-16 model pre-trained on ImageNet as the base model and adds multiple convolutional layers of decreasing sizes as the top. VGG-16 extracts the useful image features from images and then object detection happens in each convolutional layer. Meanwhile, the decreasing sizes prevent the objects with various sizes, either large or small, could be captured well.
Especially for scene text detection, the main challenges of scene text detection lie on arbitrary orientations, small sizes, and significantly variant aspect ratios of text in natural images. In this paper, one of the SSD applications, TextBoxes++ performs efficiently and accurately compared to models in YOLO and R-CNN families.
Comparing to the baseline Faster R-CNN model, SSD is more faster and keeps similar accuracy since SSD and YOLO are regression-based object detection model whereas RCNN models handle detection as a classification problem by building a pipeline where first object proposals are generated and then these proposals are send to classification/regression heads. Furthermore, instead of YOLO providing extremely fast speed, SSD balances the tradeoff between the speed and the accuracy. After we implemented different models (Faster R-CNN, YOLO, SSD) pre-trained on many large datasets, such as MS-COCO, SSD predicts the most accurate bounding boxes for scene text in Google Street View dataset.
SSD 512 x 512 with MobileNetv2
The SSD model was pre-trained on images with obvious text and fine-tuned on Google Street View data (SVT).
The SVT images were resized to 512 x 512 dimensions in order to keep the information and the loss was customized by the yolo paper since SSD loss is the same as the yolo loss, the sum of a localization loss and a classification loss. The majority of changes is the revised base model, VGG-16 was replaced by MobileNetv2 due to the high accuracy and the less number of weights.
Results
We froze the layers in MobileNetv2 base and made the convolutional layers trainable to best detect the text on SVT images. After training 200 epochs, the predictions changed obviously:

As the left images show, the pre-trained SSD model without fine-tuning detects the text inaccurately, for example, the windows might be detected as “lll”, and misses some text as the comparison of the lower two images. One significant improvement is that the model could better capture small and vague text in natural street images.
Future Improvements
Although the SSD model after fine-tuning improves the performance compared to baseline and other models, the handwritten-style text and text with rotations are hardly detected as the images above display so the approaches for potential improvements will be shown in detail in the next blog.
Test Recognition
Convolutional Recurrent Neural Network (CRNN)
Convolutional Recurrent Neural Network is a novel neural network architecture that integrates the advantages of both Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). As the text recognition task has long been performed in deep learning study, CRNN model can take input images of different dimensions and return the final results with different lengths. We can directly put coarse text input in the training process without detailed annotations. CRNN does not use fully connected layers that are widely applied in CNN. This act helps the model be a much more compact and efficient model. Considering the above-mentioned properties, we implement the CRNN methodology on the Google IIIT 5k Word Recognition dataset to check text detection performance.
The following is a graph showing the CRNN model architecture.
Model Performance: In our experiment, we built the CRNN model architecture based on the paper An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition. The following two graphs show a detailed architecture of our designed model.
Result:
From a general perspective, we can observe that the CRNN model has shown a satisfying loss curve from the above plot. Both training and validation loss curves show a clear decreasing trend during the first 10 epochs. However, we can see that the model has shown a tendency of overfitting after 12 epochs. In summary, we have reached the best validation loss score of 6.0541. To obtain a clear understanding of how the model recognizes the image data, we printed several sample prediction texts together with the original image.
Discussion and Future Improvements: From the above plot, we can see that our model has a good performance when the input images contain printed text. However, it has poor results when the text in an image is created in an artistic or special font. We trained 3000 images for the model fit, and selected 500 images for the validation set. Accordingly, we hope to train more images for the model in order to improve the capability of capturing unique-style characters. Another direction is that some prediction results only contain minor grammar or spelling errors. We would like to add proof-check architecture after we return the result so that the text prediction can be more precise.
Attention-based OCR
The model first runs a sliding CNN on the image (images are resized to height 32 while preserving aspect ratio). Then an LSTM is stacked on top of the CNN. Finally, an attention model is used as a decoder for producing the final outputs.
Results: The loss and perplexity score of the model has decreased a lot from the start. This model is trained using 2000 images from IIT5k dataset and test on 3000 images from IIT5k dataset. The prediction accuracy is 0.61 (We only consider correct as the whole word text is predicted right). This model performs better than normal RCNN as it uses attention mechanism in decoder
In the following, we will see two predictions which represent different attention into the same image input. The attention still needs to be fine-tuned to output the correct predictions.
Next Steps:
- Improve the text detection part: the pre-trained SSD model is not good at detecting content that only takes a small portion of the current image size. In other words, objects that are blur due to distance or size will be misclassified or ignored. Furthermore, text objects that are crooked or inclined still pose a challenge to the existing model.
- Improve the text recognition: to improve the text recognition, we may combine models such attention OCR with our self-trained CRNN.
- Incorporate our model into mobile application: we have a demo using ML-Kit which is based on the Firebase service. In the same way, we will set up our model on Firebase. For the model to be adapted to mobile platforms, we will compress the weights into lite form, which requires less power to host on small devices.
Reference:
- Text Detection: https://arxiv.org/abs/1512.02325
https://lilianweng.github.io/lil-log/2018/12/27/object-detection-part-4.html#ssd-single-shot-multibox-detector
https://arxiv.org/abs/1801.02765
https://cv-tricks.com/object-detection/faster-r-cnn-yolo-ssd/ - Text Recognition: https://github.com/emedvedev/attention-ocr
https://arxiv.org/pdf/1507.05717.pdf
https://github.com/FLming/CRNN.tf2
https://www.kaggle.com/samfc10/handwriting-recognition-using-crnn-in-keras
https://github.com/MaybeShewill-CV/CRNN_Tensorflow - ML-Kit: https://firebase.google.com/docs/ml-kit/android/use-custom-models