AI Case Study

Researchers at Macau University of Science and Technology develop a new model for disease classification in cases of limited labelled data with ~90% accuracy by combining logistic regression and semi-supervised learning

Due to lack of sufficient labeled data existing semi-supervised algorithms fall short in identifying related genes and classifying the disease. To overcome this researchers from Macau university have come up with a new logistic regression model combining active learning and semi-supervised learning which has achieved accuracy of above 90%.

Industry

Public And Social Sector

Education And Academia

Project Overview

"Researchers developed a novel logistic regression model based on complementarity of active learning and semi-supervised learning, for utilizing the unlabeled samples with least cost to improve the disease classification accuracy. In addition to that, an update pseudo-labeled samples mechanism is designed to reduce the false pseudo-labeled samples. The experiment results show that this new model can achieve better performances compared the widely used semi-supervised learning and active learning methods in disease classification and gene selection.

The work flow of our proposed logistic regression model is:

Step 1: Firstly the labeled data will be used to learn an initial logistic regression model.

Step 2: The logistic regression model will be used to label the unlabeled samples and the high value samples which are selected by SSL or AL will be included into the training dataset.

Step 3: Update the logistic regression model using the new training dataset.

Step 4: Identify the false pseudo-labeled samples. If they are selected by SSL, return them to the unlabeled sample pool. Otherwise, change their labels and put them into the training dataset directly.

Step 5: The cycle will continue until all the unlabeled samples have been labeled or the run time exceeds the maximum number of iteration."

Reported Results

* It works without any manual intervention saving time and effort

* It has achieved accuracy above 90% in disease classification better than the AL and SSL models.

Technology

Function

Strategy

Data Science

Background

"Traditional supervised learning classifier needs a lot of labeled samples to achieve good performance, however in many biological datasets there is only a small size of labeled samples and the remaining samples are unlabeled. Labeling these unlabeled samples manually is difficult or expensive."

Benefits

Data