AI Case Study

MIT researchers propose an efficient and accurate system for protecting privacy in healthcare datasets

Researchers from MIT investigate different ways their SplitNN model can be used for training healthcare machine learning models while keeping aspects of datasets separate between training and implementation to protect patient privacy. Their version of dataset splitting outperforms two others in terms of accuracy and computation resources required.



Healthcare Providers And Services

Project Overview

The authors investigate different methods under their SplitNN system for federated learning where complete datasets are split so that data privacy concerns are protected. The three data configurations they try include:
simple vanilla configuration for split learning,
u-shaped configurations for split learning without label sharing,
and vertically partitioned data for split learning, along with how they affect model accuracy and resources required.

Reported Results

"In this distributed learning experiment we clearly see that SplitNN outperforms the techniques of federated learning and large batch synchronous SGD in terms of higher accuracies with drastically lower computational requirements on the side of clients. SplitNN is also scalable to large-scale settings and can use any state of the art deep learning architectures. In addition, the boundaries of resource efficiency can be pushed further in distributed deep learning by combining splitNN with neural network compression methods for seamless distributed learning with edge devices."


"As a concrete example we walkthrough the case where radiology centers collaborate with pathology test centers and a server for disease diagnosis. As shown in Fig. 2c radiology centers holding imaging data modalities train a partial model up to the cut layer. In the same way the pathology test center having patient test results trains a partial model up to its own cut layer. The outputs at the cut layer from both these centers are then concatenated and sent to the disease diagnosis server that trains the rest of the model. This process is continued back and forth to complete the forward and backward propagations in order to train the distributed deep learning model without sharing each others raw data."


R And D

Core Research And Development


"Collaboration in health is heavily impeded by lack of trust, data sharing regulations such as HIPAA and limited consent of patients. In settings where different institutions hold different modalities of patient data in the form of electronic health records (EHR), picture archiving and communication systems (PACS) for radiology and other imaging data, pathology test results, or other sensitive data such as genetic markers for disease, collaborative training of distributed machine learning models without any data sharing is desired."



"the comparisons were done on the CIFAR 10 and CIFAR 100 datasets using VGG and Resnet-50 architectures for 100 and 500 client based setups respectively".