AI Case Study

Researchers at the University of Lancaster break CAPTCHA systems on 33 highly visited websites using a generative adversarial network

Researchers at the University of Lancaster, Northwest University and Peking University have trained a generative adversarial network (GAN) to break CAPTCHAs (Completely Automated Public Turing tests to tell Computers and Humans Apart). The difference of the system in comparison to previous attempts at the same task is that the researchers did not train it on a large dataset of examples. They only used 33 text-based CAPTCHA schemes from popular websites. The model achieved a larger than 80% success rate in solving CAPTCHAs and exceeded 100% on specific websites.

Industry

Technology

Software And It Services

Project Overview

"Several attacks on CAPTCHAs have been proposed in the past, but none has been as accurate and fast as the machine learning algorithm presented by a group of researchers from Lancaster University, Northwest University and Peking University showed below.

One of the first known people to break CAPTCHAs was Adrian Rosebrock, whom in his book “Deep Learning for Computer Vision with Python,” [4] Adrian goes through how he bypassed the CAPTCHA systems on the E-ZPass New York website using machine learning. Where he used deep learning to train his model by downloading a large image dataset of CAPTCHA examples in order to break the CAPTCHA systems.

The main difference between Adrian’s solution and the solution from the research scientists from Lancaster, Northwest and Peking, is that the researchers did not have a need to download a large dataset of images in order to break the CAPTCHAs system, au contraire, they used the concept of a generative adversarial network (GAN) in order to create synthesized CAPTCHAs, along a small dataset of real CAPTCHAs in order to create an extremely fast and accurate CAPTCHA solver.

Generative adversarial networks, introduced by Ian Goodfellow along other researchers [2], are deep neural net architectures comprised of two neural networks, which compete against the other in a zero-sum game [3] in order to synthesize superficially authentic samples. These are especially useful in scenarios where the model does not have access to a large dataset.

The researchers evaluated their approach by applying 33 text-based CAPTCHA schemes, 11 which are currently being used by 32 of the world’s most popular websites ranked by Alexa. Including CAPTCHA schemes being used by Google, Microsoft, eBay, Wikipedia, Baidu and many others. The machine learning model used to attack these CAPTCHA systems only needed 500 non-synthesized CAPTCHAs instead of millions of examples as other attacks before this one (such as Adrian’s) have proposed.

Once the model was initialized with the CAPTCHAs security parameters in mind shown in Figure 2, it was used to generate a batch of synthetic CAPTCHAs in order to train the synthesizer with the 500 real CAPTCHAs obtained from the diverse CAPTCHA schemes shown in Figure 3. The researchers used 20,000 CAPTCHAs to train the pre-processing model along 200,000 synthetic CAPTCHAs to train the base solver.
The machine learning prototype was implemented using Python, the pre-processing model is built using the Pix2Pix framework, which was implemented using Tensorflow. The fine-tuned solver was coded using Keras.

After the generative adversarial networks were trained by using the synthesized and real CAPTCHA samples, the CAPTCHA solver was used then to solve CAPTCHAs from highly visited websites, such as Megaupload, Blizzard, Authorize, Captcha.net, Baidu, QQ, reCaptcha, Wikipedia, etc. The impressive approach of this method is that most of the sites CAPTCHAs were solved with over 80% success rate, exceeding 100% on sites like Blizzard, Megaupload and Authorize.net, an attack method that has proven to have better accuracy on all other prior methods to solve CAPTCHAs, which used large non-synthesized training datasets.

Other than enhanced accuracy, the researchers mentioned on their paper that their approach was not only more accurate, but also more efficient, and less expensive to implement that other methodologies proposed [1]. Besides from being the first GAN based solved for text-based CAPTCHAs, it is an open door for attackers to use, hence their effectiveness and inexpensiveness to implement.

Nevertheless, the approach has some limitations, such as the use of CAPTCHAs with variable numbers of characters, the current approach uses a fixed number of characters — if it’s extended the prototype would break. Another, is the use of variable characters on the CAPTCHA, while the prototype can be trained to support this change, it currently does not as is.

It is important for highly visited websites to use more robust ways to protect their systems, such as bot-detection measures, cyber-security diagnoses and analytics, along multiple layers of security such as device location, types, browsers, etc. — as they are now, an even easier target to attack."

Reported Results

Technology

Function

Background

"CAPTCHAs (Completely Automated Public Turing tests to tell Computers and Humans Apart) were developed to prevent automatized programs from being mischievous (filling out online forms, accessing restricted files, accessing a website an incredible amount of times, etc.) on the world wide web, by verifying that the end-user is in fact “human” and not a bot."

Benefits

Data