AI Case Study
Facebook reduces time needed to support new queries to its internal reactive cache by using machine learning from weeks to minutes
Facebook has adopted machine learning to investigate the effect of new query requests in its internal caching system. This replaces a heuristics-based engineering approach where new queries would have to be tested by engineers to determine their effects.
Internet Services Consumer
"The Facebook codebase is pushed to production every few hours — for example, new versions of the front end — as part of our continuous deployment process. In this dynamic world, trying to manually fine-tune services to maintain peak efficiency is impractical. It is simply too difficult to rewrite caching/admission/eviction policies and other manually tuned heuristics by hand. We have to fundamentally change how we think about software maintenance.
To efficiently address this challenge, the system needed to become self-tuning rather than rely on manually hard-coded heuristics and parameters. This shift prompted Facebook engineers to approach work in a new way: Instead of looking at charts and logs produced by the system to verify correct and efficient operation, engineers now express what it means for a system to operate correctly and efficiently in code. Today, rather than specify how to compute correct responses to requests, our engineers encode the means of providing feedback to a self-tuning system.
To more effectively optimize our many services, with the flexibility to adapt to a constantly changing interconnected web of internal services, we have developed Spiral. Spiral is a system for self-tuning high-performance infrastructure services at Facebook scale, using techniques that leverage real-time machine learning."
Unlike hard-coded heuristics, Spiral-based heuristics can adapt to changing conditions. In the case of a cache admission policy, for example, if certain types of items are requested less frequently, the feedback will retrain the classifier to reduce the likelihood of admitting such items without any need for human intervention.
With a Spiral-based cache invalidation mechanism, the time required to support a new query in the reactive cache came down from weeks to minutes. Before Spiral, reactive cache engineers had to inspect each new query’s side effects by running experiments and collecting data manually. With Spiral, however, most use cases (mapping to a query) are learned by the local model automatically within minutes, so the local inference is available immediately."
"Spiral uses machine learning to create data-driven and reactive heuristics for resource-constrained real-time services. The system allows for much faster development and hands-free maintenance of those services, compared with the hand-coded alternative. Integration with Spiral consists of adding just two call sites to your code: one for prediction and one for feedback. The prediction call site is the output of the smart heuristic used to make decisions, such as “Should this item be admitted into the cache?” The prediction call is implemented as a fast local computation and is meant to be executed on every decision."
The feedback call site is for providing occasional feedback, such as “This item expired from the cache without ever being hit, so we should probably not cache items like this one. In Spiral, learning starts as soon as feedback comes in. Prediction quality improves progressively as more feedback is generated. In most services, feedback is available within seconds to minutes, so the development cycle is very short. Domain experts can add a new feature and see within minutes whether it is helping to improve the quality of predictions."
R And D
"Facebook is built using thousands of services, with functions ranging from balancing internet traffic to transcoding images to providing reliable storage. The efficiency of Facebook as a whole is the sum of the efficiencies of its individual services, and each service is typically optimized in its own way, with approaches that may be difficult to generalize or adapt in the face of fast-paced changes."
"The data sent to the server is sampled with a counter-bias to avoid percolating class imbalance biases in the samples. For example, if over a period of time we receive 1,000 times more negative examples than positive ones, we need only log 1 in 1,000 negative examples to the server, while also indicating that it has a weight of 1,000. The server’s visibility into the global distribution of the data usually leads to a better model than any individual node’s local model."