AI Case Study
Dropbox optimises document ranking and retrieval features with the use of machine learning
Dropbox has leveraged machine learning to develope a new full-text search engine to provide a more personalised experience to its users. The system, called Nautilus, consists of indexing and serving, two mostly-independent sub-systems. The former's role is to create a search index while the latter's to use it to return results according to the search. Thus, the system aims to better rank search results according to each user’s preferences.
Software And It Services
"Nautilus consists of two mostly-independent sub-systems: indexing and serving.
The role of the indexing pipeline is to process file and user activity, extract content and metadata out of it, and create a search index. The serving system then uses this search index to return a set of results in response to user queries. Together, these systems span several geographically-distributed Dropbox data centers, running tens of thousands of processes on more than a thousand physical hosts.
The simplest way to build an index would be to periodically iterate through all files in Dropbox, add them to the index, and then allow the serving system to answer requests. However, such a system wouldn’t be able to keep up with changes to documents in anything close to real-time, as we need to be able to do. So we follow a hybrid approach which is fairly common for search systems at large scale:
We generate “offline” builds of the search index on a regular basis (every 3 days, on average)
As users interact with files and each other, such as editing files or sharing them with other users, we generate “index mutations” that are then applied to both the live index and to a persistent document store, in near real-time (on the order of a few seconds).
Some other key pieces of the system that we’ll talk about are how to index different kinds of content, including using ML for document understanding and how to rank retrieved search results (including from other search indexes) using an ML-based ranking service.
What are the kinds of things users would like to search by? Of course there is the content of each document, i.e., the text in the file. But there are also numerous other types of data and metadata that are relevant.
We designed Nautilus to flexibly handle all of these and more, through the ability to define a set of “extractors” each of which extracts some sort of output from the input file and writes to a column in our “document store.” The underlying technology has extra custom built layers that provide access control and data encryption. It contains one row per file, with each column containing the output from a particular extractor. One significant advantage of this schema is that we can easily update multiple columns on a row in parallel without worrying about changes from one extractor interfering with those from others.
For most documents, we rely on Apache Tika to transform the original document into a canonical HTML representation, which then gets parsed in order to extract a list of “tokens” (i.e. words) and their “attributes” (i.e. formatting, position, etc…).
After we extract the tokens, we can augment the data in various ways using a “Doc Understanding” pipeline, which is well suited for experimenting with extraction of optional metadata and signals. As input it takes the data extracted from the document itself and outputs a set of additional data which we call “annotations.” Pluggable modules called “annotators” are in charge of generating the annotations. An example of a simple annotator is the stemming module which generates stemmed tokens based on raw tokens. Another example is converting tokens to embeddings for more flexible search.
As mentioned earlier, we tune our retrieval engine to return a large set of matching documents, without worrying too much about how relevant each document is to the user. The ranking step is where we focus on the opposite end of the spectrum: picking the documents that the user is most likely to want right now. (In technical terms, the retrieval engine is tuned for recall, while the ranker is tuned for precision.)
The ranking engine is powered by a ML model that outputs a score for each document based on a variety of signals. Some signals measure the relevance of the document to the query (e.g., BM25), while others measure the relevance of the document to the user at the current moment in time (e.g., who the user has been interacting with, or what types of files the user has been working on).
The model is trained using anonymized “click” data from our front-end, which excludes any personally identifiable data. Given searches in the past and which results were clicked on, we can learn general patterns of relevance. In addition, the model is retrained or updated frequently, adapting and learning from general users’ behaviors over time.
The main advantage of using an ML-based solution for ranking is that we can use a large number of signals, as well as deal with new signals automatically. For example, you could imagine manually defining an “importance” for each type of signal we have available to us, such as which documents the user interacted with recently, or how many times the document contains the search terms. This might be doable if you only have a handful of signals, but as you add tens or hundreds or even thousands of signals, this becomes impossible to do in an optimal way. This is exactly where ML shines: it can automatically learn the right set of “importance weights” to use for ranking documents, such that the most relevant ones are shown to the user. For example, by experimentation, we determined that freshness-related signals contribute significantly to more relevant results."
"Over the last few months, the Search Infrastructure engineering team at Dropbox has been busy releasing a new full-text search engine called Nautilus, as a replacement for our previous search engine.
Search presents a unique challenge when it comes to Dropbox due to our massive scale—with hundreds of billions of pieces of content—and also due to the need for providing a personalized search experience to each of our 500M+ registered users. It’s personalized in multiple ways: not only does each user have access to a different set of documents, but users also have different preferences and behaviors in how they search. This is in contrast to web search engines, where the focus on personalization is almost entirely on the latter aspect, but over a corpus of documents that are largely the same for each user (localities aside).
In addition, some of the content that we’re indexing for search changes quite often. For example, think about a user (or several users) working on a report or a presentation. They will save multiple versions over time, each of which might change the search terms that the document should be retrievable by.
More generally, we want to help users find the most relevant documents for a given query—at this particular moment in time—in the most efficient way possible. This requires being able to leverage machine intelligence at several stages in the search pipeline, from content-specific machine learning (such as image understanding systems) to learning systems that can better rank search results to suit each user’s preferences."
"After a period of qualification where Nautilus was running in shadow mode, it is currently the primary search engine at Dropbox. We’ve already seen significant improvements to the time-to-index new and updated content, and there’s much more to come."
"This requires being able to leverage machine intelligence at several stages in the search pipeline, from content-specific machine learning (such as image understanding systems) to learning systems that can better rank search results to suit each user’s preferences."
Users' Dropbox files