AI Case Study
Dropbox's in-house OCR system for document scanning outperforms commercial libraries by using neural networks
Dropbox decided to replace their commercial optical character recognition (OCR) platform with an in-house one to support the document scanning feature on their app. The new system, now in production, uses computer vision, convolutional neural networks, and long short term memory.
Internet Services Consumer
"The last few years has seen the successful application of deep learning to numerous problems in computer vision that have given us powerful new tools for tackling OCR without having to replicate the complex processing pipelines of the past, relying instead on large quantities of data to have the system automatically learn how to do many of the previously manually-designed steps." Dropbox's mobile document scanner app uses "computer vision and deep learning advances such as bi-directional Long Short Term Memory (LSTMs), Connectionist Temporal Classification (CTC), convolutional neural nets (CNNs), and more."
Optical character recognition (OCR) is used in order to recognise text from images of documents in order to make it searchable and copyable. "Traditionally, OCR systems were heavily pipelined, with hand-built and highly-tuned modules taking advantage of all kinds of conditions they could assume to be true for images captured using a flatbed scanner... The process to build these OCR systems was very specialized and labor intensive, and the systems could generally only work with fairly constrained imagery from flat bed scanners."
"In all, this entire round of researching, productionization, and refinement took about 8 months, at the end of which we had built and deployed a state-of-the-art OCR pipeline to millions of users using modern computer vision and deep neural network techniques."
"The Word Deep Net combines neural network architectures used in computer vision and automatic speech recognition systems. Images of cropped words are fed into a Convolutional Neural Net (CNN) with several convolutional layers. The visual features that are output by the CNN are then fed as a sequence to a Bidirectional LSTM (Long Short Term Memory) — common in speech recognition systems — which make sense of our word “pieces,” and finally arrives at a text prediction using a Connectionist Temporal Classification (CTC) layer. Batch Normalization is used where appropriate.
For our Word Detector we decided to not use a deep net-based approach. We ended up using a classic computer vision approach named Maximally Stable Extremal Regions (MSERs), using OpenCV’s implementation. The MSER algorithm finds connected regions at different thresholds, or levels, of the image. Essentially, they detect blobs in images, and are thus particularly good for text. Our Word Detector first detects MSER features in an image, then strings these together into word and line detections. One tricky aspect is that our word deep net accepts fixed size word image inputs. This requires the word detector to thus sometimes include more than one word in a single detection box, or chop a single word in half if it is too long to fit the deep net’s input size.
Once we had refined our Word Detector to an acceptable point, we chained it together with our Word Deep Net".
Data annotation was done manually. Dropbox used a "word-level dataset, which has images of individual words and their annotated text, as well as a full document-level dataset, which has images of full documents (like receipts) and fully transcribed text. We used the latter to measure the accuracy of existing state-of-the-art OCR systems; this would then inform our efforts by telling us the score we would have to meet or beat for our own system."
A million-word synthetic training dataset was generated using vocabulary from Project Gutenberg, 2,000 fonts, and some distortion.