How to break a simple captcha!!!
Recently, I have noticed a login page which has a very simple captcha. As we all already knew, CAPTCHA is a backronym for Completely Automated Public Turing test to tell Computers and Humans Apart. webpages use this test to determine the user is a computer software or human. Simple capcha can't do the job well since you can write a computer software to predict (let's say read) the captcha and fill the form automatically. So captcha images must be somehow hard to read.
There are various type of captcha and the simplest one is an image which consists of some digits and letters. But all the CAPTCHAs must not be images. In fact, any hard artificial intelligence problem, such as speech recognition could be used as CAPTCHA. If you want to know more about CAPTCHAs and it's history, you can read more on wikipedia.
Now let's try to break an easy captcha using some machine learning algorithm. The image below shows the captcha from a login webpage. As you can see all the captchas in the image have green background and black random line as noise. Note that all the characters are always white. This is actually not even noise since we define noise as an unwanted signal or data which has a random nature.( The formal definition of the noise is : noise can refer to any random fluctuations of data that hinders perception of an expected signal). The noise in this captcha has pattern!!!. So by observing the pattern, we can remove it and create a simple character classification problem.
So this is what we are going to do:
- Convert the RGB image to grayscale image.
- Convert the grayscale image to binary one using some thereshold (here we choose Th=200)
- Split the image into four equal part, each part has only one character
- Label all the images for training.
- Train a machine learning algorithm to detect characters (I have used Neural Nets)
- Test the trained machine using real captcha.
The first step is to convert a color image to grayscale. This is the python code to do the job.
#convert rgb image to grayscale #input parameter must be a numpy array of size MxNx3 #return value is a MxN matrix. def rgb2grayscale(img): if(len(img.shape)!=3): print("Error: This is not a 3d matrix"); return; (X,Y,D) = img.shape temp = np.ndarray((X,Y),'float32'); (row,col,dim) = img.shape; if(dim!=3): print("error: the input matrix dimension is not 3"); return; rprim = 0.2989 * img[:,:,0]; gprim = 0.5870 * img[:,:,1]; bprim = 0.1140 * img[:,:,2]; temp = rprim + gprim + bprim ; #process here return temp;
The second step is to convert the grayscale image to binary one using below python code:
#convert grayscale image 2 binary image #input parameters: #img is a MxN matrix #th is a thereshold ( here th is 200) def grayscale2binary(img,th): img[img < th] = 0; img[img > th] = 1; return img;
And the result would be something like the image below:
The next step is a little tricky. We know that all the captchas has only four characters and characters are not overlapped. So this is the algorithm to split the image to four equal subimage.
- 1. read each binary image into memory.
- 2. set index1 to zero (index1 = 0)
- 3. start from index1 column of the image
- 3.1 go ahead until find a column which has atleas one white pixel this is our first index(index1)
- 3.2 go ahead until find a column which all the pixels are black(zero value) this is our second index(index2)
- 3.3 split the subimage between index1 and index2
- 3.4 index1 = index2
- 4. go to 3 (four times since each image has four characters)
Here is the code for above algorithm (from observation, we already know that all the subimages must be 25x10 so after splitting images we have to pad smaller image to 25x10 size).
result = list(); start_index = 0; end_index = -1; for i in range(4): #4 character in each captcha find_white = False; find_black = False; start_index = end_index+1 while(find_white == False): if(sum(img[:,start_index])==0): start_index += 1 else: find_white = True; end_index = start_index; while(find_black == False): if(sum(img[:,end_index])!=0): end_index += 1; else: find_black = True; #now we have character image but we have to padd it to standard #in this case standard is (25,10) #only need to check the columns new_char_image = img[:,start_index-1:end_index+1] (row,col) = new_char_image.shape if(col<10): remaining = 10 - col; temp = np.zeros((25,10)); temp[:,:-remaining] = new_char_image result.append(temp); print(image.size(temp)) elif (col==10): result.append(new_char_image) else: print('oops!!!!...this is odd');
And the results are as below.
Now this is a matter of classification. All we have to do is to train a machine learning algorithm and use it to break some captcha. I have provided a full package which includes all the codes in python also a training dataset and it's label. you can download it from here
There are four files in the package that need some explanation:
This file contains 3104 image of 25x10 column. each byte value is between 0 and 255 (a grayscale value).
[offset] [type] [value] [description] 0000 8bit uchar 0~255 pixel(0,0) 0001 8bit uchar 0~255 pixel(0,1) 0002 8bit uchar 0~255 pixel(0,2) 0003 8bit uchar 0~255 pixel(0,3) .... .... ... ... 0010 8bit uchar 0~255 pixel(1,0) 0011 8bit uchar 0~255 pixel(1,1) .... ..... ... ... xxxx 8bit uchar ?? pixel(n,m)
contains 3104 byte and each byte associated to each image in train_image.train file (the ascii code of the bytes are the labels of the image).
[offset] [byte value] [label] 0000 39 9 0001 7a z 0002 64 d 0003 64 d 0004 37 7 .... .. .
This is a numpy array file of the weights between input and hidden layer of our network.
This is a numpy array file of the weights between hidden and output layer of our network.
I have tested the algorithm and it has more than 95% accuracy. If you have any question, do not hesitate to ask.