AI Case Study

Researchers from Northwestern University achieved 88% accuracy predicting Amazon bestsellers with machine learning

Researchers from Northwestern University predict which books become Amazon best-sellers after a month, based on genre, author, Goodreads reviews and user characteristics, achieving 88% accuracy. They use different machine learning classification methods for the prediction, and find that characteristics related to users and genres were better indicators than actual reviews and ratings.

Industry

Consumer Goods And Services

Media And Publishing

Project Overview

"We analyze a large dataset from Goodreads to understand various characteristic differences existing between the Amazon best selling books and the rest and make the following contributions.

Reported Results

"On a balanced set, we are able to achieve a very high average accuracy of 88.72% (85.66%) for the prediction where the other competitive class contains books which are randomly selected from the Goodreads dataset. Our method primarily based on features derived from user posts and genre related characteristic properties achieves an improvement of 16.4% over the traditional popularity factors (ratings, reviews) based baseline methods. We also evaluate our model with two more competitive set of books a) that are both highly rated and have received a large number of reviews (but are not best sellers) (HRHR) and b) Goodreads Choice Awards Nominated books which are non-best sellers (GCAN). We are able to achieve quite good results with very high average accuracy of 87.1% and as well a high ROC for ABS vs GCAN. For ABS vs HRHR, our model yields a high average accuracy of 86.22%."

Technology

"We compute all the feature values from the data available only within the time period t from the publication date. We ensure
that all the books that we select in both the classes are published after 2007 since Goodreads was launched in 2007. The classifiers yield very similar classification performance with Logistic Regression performing little better; with logistic regression classifier, we obtain average accuracy of 88.72% with average
precision and recall of 0.887 each and the average area under the ROC curve as 0.925 for t = 1 month on a balanced dataset with 10-fold cross-validation method. Note that the classification results for other time period also give very similar results. The user status and genre based features are most prominent ones and significantly outperforms the ratings and review feature based baselines."

Function

Marketing

Marketing Research Planning

Background

"Analysis of reading habits has been an active area of research for quite long time. While most of these research investigate blog reading behavior, there have been some work that also discuss about interactive and connected book reading behavior. Despite such active research, very little investigation has been done so far to understand the characteristics of social book reading sites and how the collective reading phenomena can even influence the online sales of books. In this work, we attempt to bridge this gap and analyze the various factors related to book reading on a popular platform – Goodreads and apply this knowledge to distinguish Amazon best seller books from the rest."

Benefits

Data

"We obtain our Goodreads dataset through APIs and web-based crawls over a period of 9 months. This crawling exercise has resulted in the accumulation of a massive dataset spanning a period of around nine years. We first identify the unique genres from https://www.goodreads.com/genres/list. Note that genres in the Goodreads community are user defined. Next we collect unique books from the above list of genres and different information regarding these books are crawled via Goodreads APIs. Each book has information like the name of the author, the published year, number of ratings it received, average rating, number of reviews etc. In total, we could retrieve information of 558,563 books. We then find out the authors of these books and their information like number of distinct works, average rating, number of ratings, number of fans etc. In total, we have information of 332,253 authors. We separately collect the yearly Amazon best sellers8 from 1995 to 2016 and their ISBNs and then re-crawl Goodreads if relevant information about some of them is not already present in the crawled dataset. For these books, we separately crawl upto 2000 reviews and ratings in chronological order. We also crawl relevant shelves information of those books.

Since in real-life, the proportion of the Amazon best sellers is far lower than the other types of books, we also consider testing our model on an unbalanced test set. Here, the training and test sample sets are taken in 3:1 ratio. In training set, both the class samples are taken in equal proportion (to guarantee fair learning) whereas in test sample the Amazon best sellers and the other books are taken in 1:9 ratio. We then train our classifiers on the balanced training set and test on the unbalanced one."