How to break a simple captcha!!!

Recently, I have noticed a login page which has a very simple captcha. As we all already knew, CAPTCHA is a backronym for Completely Automated Public Turing test to tell Computers and Humans Apart. webpages use this test to determine the user is a computer software or human. Simple capcha can't do the job well since you can write a computer software to predict (let's say read) the captcha and fill the form automatically. So captcha images must be somehow hard to read.

There are various type of captcha and the simplest one is an image which consists of some digits and letters. But all the CAPTCHAs must not be images. In fact, any hard artificial intelligence problem, such as speech recognition could be used as CAPTCHA. If you want to know more about CAPTCHAs and it's history, you can read more on wikipedia.

Now let's try to break an easy captcha using some machine learning algorithm. The image below shows the captcha from a login webpage. As you can see all the captchas in the image have green background and black random line as noise. Note that all the characters are always white. This is actually not even noise since we define noise as an unwanted signal or data which has a random nature.( The formal definition of the noise is : noise can refer to any random fluctuations of data that hinders perception of an expected signal). The noise in this captcha has pattern!!!. So by observing the pattern, we can remove it and create a simple character classification problem.

So this is what we are going to do:

The first step is to convert a color image to grayscale. This is the python code to do the job.


	#convert rgb image to grayscale
	#input parameter must be a numpy array of size MxNx3
	#return value is a MxN matrix.
	def rgb2grayscale(img): 
		if(len(img.shape)!=3):
			print("Error: This is not a 3d matrix");
			return;
		(X,Y,D) = img.shape
		temp = np.ndarray((X,Y),'float32');
		(row,col,dim) = img.shape;
		if(dim!=3):
			print("error: the input matrix dimension is not 3");
			return;
		rprim = 0.2989 * img[:,:,0];
		gprim = 0.5870 * img[:,:,1];
		bprim = 0.1140 * img[:,:,2];
		temp = rprim + gprim + bprim ;
		#process here
		return temp;
		

The second step is to convert the grayscale image to binary one using below python code:


	#convert grayscale image 2 binary image
	#input parameters:
	#img is a MxN matrix
	#th is a thereshold ( here th is 200)
	def grayscale2binary(img,th):
		img[img < th] = 0;
		img[img > th] = 1;
		return img;
		

And the result would be something like the image below:

The next step is a little tricky. We know that all the captchas has only four characters and characters are not overlapped. So this is the algorithm to split the image to four equal subimage.

Here is the code for above algorithm (from observation, we already know that all the subimages must be 25x10 so after splitting images we have to pad smaller image to 25x10 size).


result = list();
start_index = 0;
end_index = -1;
for i in range(4): #4 character in each captcha
	find_white = False;
	find_black = False;
	start_index = end_index+1
	while(find_white == False):
		if(sum(img[:,start_index])==0):
			start_index += 1
		else:
			find_white = True;
	end_index = start_index;
	while(find_black == False):
		if(sum(img[:,end_index])!=0):
			end_index += 1;
		else:
			find_black = True;
	#now we have character image but we have to padd it to standard
	#in this case standard is (25,10)
	#only need to check the columns
	new_char_image = img[:,start_index-1:end_index+1]
	(row,col) = new_char_image.shape
	if(col<10):
		remaining = 10 - col;
		temp = np.zeros((25,10));
		temp[:,:-remaining] = new_char_image
		result.append(temp);
		print(image.size(temp))
	elif (col==10):
		result.append(new_char_image)
	else:
		print('oops!!!!...this is odd');
		

And the results are as below.

Now this is a matter of classification. All we have to do is to train a machine learning algorithm and use it to break some captcha. I have provided a full package which includes all the codes in python also a training dataset and it's label. you can download it from here

There are four files in the package that need some explanation:

I have tested the algorithm and it has more than 95% accuracy. If you have any question, do not hesitate to ask.