AI Case Study
Facebook improves fast, accurate translations for more languages using unsupervised machine learning
Researchers at Facebook AI Research developed a system of machine translation based on unsupervised learning to improve translation accuracy between languages. The system was trained on a bilingual dictionary which creates associations between words and their plausible translations in the other language. The system has proven to be a significant development in unsupervised approaches and accounts for the equivalent of supervised learning trained with about 100,000 reference translations.
Internet Services Consumer
"In our research, we identified three steps — word-by-word initialization, language modeling, and back translation — as important principles for unsupervised MT. Equipped with these principles, we can derive various models. We applied them to two very different methods to tackle our goal of unsupervised MT.
The first one was an unsupervised neural model that was more fluent than word-by-word translations but did not produce translations of the quality we wanted. They were, however, good enough to be used as back-translation sentences. With back translation, this method performed about as well as a supervised model with 100,000 parallel sentences.
Next, we applied the principles to another model based on classical count-based statistical methods, dubbed phrase-based MT. These models tend to perform better on low-resource language pairs, which made it particularly interesting, but this is the first time this method has been applied to unsupervised MT. In this case, we found that the translations had the correct words but were less fluent. Again, this method outperformed previous state-of-the-art unsupervised models.
Finally, we combined both models to get the best of both worlds: a model that is both fluent and good at translating. To do this, we started from a trained neural model and then trained it with additional back-translated sentences from the phrase-based model.
Empirically, we found that this last combined approach dramatically improved accuracy over the previous state-of-the-art unsupervised MT — showing an improvement of more than 10 BLEU points on English-French and English-German, two language pairs that have been used as a test bed (and even for these language pairs, there is no use of any parallel data at training time — only at test time, to evaluate).
We also tested our methods on distant language pairs like English-Russian; on low-resource languages like English-Romanian; and on an extremely low-resource and distant language pair, English-Urdu. In all cases, our method greatly improved over other unsupervised approaches, and sometimes even over supervised approaches that use parallel data from other domains or from other languages."
R And D
"Automatic language translation is important to Facebook as a way to allow the billions of people who use our services to connect and communicate in their preferred language. To do this well, current machine translation (MT) systems require access to a considerable volume of translated text (e.g., pairs of the same text in both English and Spanish). As a result, MT currently works well only for the small subset of languages for which a volume of translations is readily available."
"Our new approach provides a dramatic improvement over previous state-of-the-art unsupervised approaches and is equivalent to supervised approaches trained with nearly 100,000 reference translations. To give some idea of the level of advancement, an improvement of 1 BLEU point (a common metric for judging the accuracy of MT) is considered a remarkable achievement in this field; our methods showed an improvement of more than 10 BLEU points.
This is an important finding for MT in general and especially for the majority of the 6,500 languages in the world for which the pool of available translation training resources is either nonexistent or so small that it cannot be used with existing systems. For low-resource languages, there is now a way to learn to translate between, say, Urdu and English by having access only to text in English and completely unrelated text in Urdu – without having any of the respective translations."
"For English, French, German and Russian, we use all available sentences from the WMT mono- lingual News Crawl datasets from years 2007 through 2017. For Romanian, the News Crawl dataset is only composed of 2.2 million sentences, so we augment it with the monolingual data from WMT’16, resulting in 2.9 million sentences. In Urdu, we use the dataset of Jawaid et al. (2014), composed of about 5.5 million monolingual sen- tences." (paper)