AI Case Study
Insurance company improves the previously manual mail sorting process with Dataiku's deep learning system
Dataiku has presented the details of a project they completed for a young insurance company. To tackle their client's issue of having to manually sort the big quantity of incoming mail, Dataiku developed a deep learning system that identifies characters. The system was first trained to distinguish between typed and handwritten letters and then detect words to categorise mail according to department. Although one third of mail still required manual sorting, the system improved its current operation process by handling them at a 1,000 letter-per-hour rate.
"At ODSC Europe 2018, Hubert detailed how his team created a fairly successful mail processing software for a young insurance company. The deep learning system successfully processed two-thirds of all mail it received at a 1,000 letter-per-hour rate. This marked an improvement over the third-party sorting service the company used before.
1. Separate Handwritten vs. Typed Letters:
Hubert’s team decided early on it would be best for their deep learning space to deal with handwritten and typed letters differently. So they had to label letters in a provided training dataset as handwritten, typed, or both. They also needed to separate anything that was not a letter (envelopes, forms, etc.). The team built a web platform and labeled about 2,000 documents manually into those four categories.
From the set of labeled data, Hubert’s team needed to label the entire collection of unlabeled training data. They used autoencoders, which take an input like an image and ask a network to reproduce it.
The network takes the input, puts it into a learning space, and recomposites it. The space is called latent space, and it contains the important features necessary to reconstruct an image quickly and efficiently.
Hubert fed the 2,000 images to autoencoders and presented their manually determined labels to the latent space. This made it easy for a traditional machine learning algorithm to label all images in the dataset. The latent space-informed model achieved a 97 percent AUC performance and very low errors, meaning it very effectively recognized handwritten vs. typed letters.
2. Deal with Typed Letters:
Hubert said dealing with typed letters was the easiest part of creating the mail processing system. Using a tool called the Tesseract Open Source Optical Character Recognition Engine, the team simply inputted the images and specified their language. Tesseract outputted the fully digitized text.
The Tesseract tool isn’t perfect. For instance, when it tries to parse signatures it produces wild and inaccurate characters. Overall, though, the tool was very effective for Hubert’s team: Digitizing the words made sorting a trivial problem. To sort, the team simply used the letters’ frequency-inverse document frequency (or tf-idf) metrics. Running a logistic regression on these metrics achieves good sorting results. Typed letters, then, could be forwarded to the proper departments with relative ease.
3. Detecting Words in Handwritten Letters:
In contrast, detecting words in handwritten letters is fairly difficult.
The deep learning network had to extract all words and find a way to digitize them.
The team started by narrowing the scope of the words they told the deep learning mechanism to read. Body paragraphs, they decided, were the only section it really needed to read to identify a letter’s topic. So Hubert and his team used computer vision and decomposition techniques to find the body paragraphs of the letters.
First, they used dilation by convolution to recognize letters’ general layout. They achieved this using a cross-dilatation kernel. Then, they applied connected component techniques. The process revealed that most written letters have the same structure, meaning the body text of a letter is easy to identify (shown in red below):
After defining the desired area of the image, they needed to identify the lines of text, and then each individual word in each line. This process mimics human reading patterns.
They chose to find the white space between each line and word to separate individual words. They used the projection profile method, which takes all the cells on a vertical axis and sums them. When there’s white space, the sum should be close to zero. When there’s something written, the sum should be quite large.
4. Extracting Words from Images:
Deep learning can to turn images of words into a computerized format, which natural language processing techniques can eventually read and organize from.
Hubert’s team did a small bit of labeling themselves and augmented those images to inflate the dataset, but needed more training data to build a truly robust deep learning tool. So they combined their labeled data with the IAM database online, which contains more than 100,000 handwritten words correctly labeled. Then they added word images made from various fonts similar to human handwriting.
Using a combination of two popular deep learning spaces — CNN and LSTM — the team trained the network stack, which took an image of a word as input and produced the digitized word. CNN captures the information from the image, meaning it learns the visual features of that image, i.e. that there’s a sequence of letters. LSTM reads the features identified by the CNN and translates the sequence.
Based on the digitized words, the team was again able to use tf-idf to find the topic of the letters and identify what department each should be sent to."
"Of all letters scanned into the system, the deep learning space decided to sort 78 percent of them, and of those 90 percent were sent to the proper department. In 22 percent of cases the system couldn’t identify a prevailing topic and the letter had to be sorted manually, along with those 10 percent of letters that were designated to the wrong department.
In the end, about a third of letters still had to be sorted manually at the insurance company. Overall, though, the process was much quicker than if the third party mail sorter had done it all. Since then, the team finally implemented curriculum learning."
"Using a combination of two popular deep learning spaces — CNN and LSTM — the team trained the network stack, which took an image of a word as input and produced the digitized word. CNN captures the information from the image, meaning it learns the visual features of that image, i.e. that there’s a sequence of letters. LSTM reads the features identified by the CNN and translates the sequence."
Dataiku's client was a 200 employees young insurance company (10 years old). The company receives 800-2000 letters per day and has outsourced mail sorting at a cost of 100K per year. (youtube presentation)
300,000 labeled word images