Showing posts with label scipy. Show all posts
Showing posts with label scipy. Show all posts

Sunday, 24 September 2017

Spam Detection using Machine Learning in Python Part 3 - Training and Testing

Welcome back to Part 3 of the tutorial. In this part we will be creating our feature set and training and testing our models. If you have not watched the previous part look here. Part 2 is really important.

Let's jump straight into today's part.

Step 4:- Creating the feature data set

From the previous part we had seen that the top 2000 words are our features. But only the features won't be enough for us. Every feature needs to have some value for a particular sentence. Since the our features are the top 2000 words among the bag of words our features can have one of 2 values. They are:-
  • True - If the feature is present in the sentence.
  • False - If the feature is not present in the sentence.
To do this specific task we create a different function called find_feature that takes the features variable word_features and a message as an input. The feature set variable called feature is a set type data structure where the word feature is the key and the presence or absence of the feature in the particular sentence is the value of the key.

def find_feature(word_features, message): # find features of a message feature = {} for word in word_features: feature[word] = word in message.lower() return feature
Let us call this function repetitively for every message in all_message variable to create our feature set.

random.shuffle(all_messages) random.shuffle(all_messages) random.shuffle(all_messages) print("\nCreating feature set....") featureset = [(find_feature(word_features, message), category) for (message, category) in all_messages] print("Feature set created.") trainingset = featureset[:int(len(featureset)*3/4)] testingset = featureset[int(len(featureset)*3/4):] print("\nLength of feature set ", len(featureset)) print("Length of training set ", len(trainingset)) print("Length of testing set ", len(testingset))
What I did here is I took all_messages and gave it a good shuffle to remove any bias. Then I called the find_feature function for all the messages in all_messages. Then I split the featureset variable in 2 parts. The first 3/4th is used to train our models and the rest 1/4th is used to set our models. So how does our feature set look like? Something like this
Here S1, S2 ... Sn are the messages or sentences
A1, A2, A3, .... A2000 are the features.
Result is the classification of the sentence or message S which are stored in all_messages.

Step 5:- Training and Testing

Now that we have our training and testing set we can now train our models. But presently our program looks something like this-

import nltk from nltk.corpus import stopwords import string def find_feature(word_features, message): # find features of a message feature = {} for word in word_features: feature[word] = word in message.lower() return feature with open('SMSSpamCollection') as f: messages = f.read().split('\n') print("Creating bag of words....") all_messages = [] # stores all the messages along with their classification all_words = [] # bag of words for message in messages: if message.split('\t')[0] == "spam": all_messages.append([message.split('\t')[1], "spam"]) else: all_messages.append([message.split('\t')[1], "ham"]) for s in string.punctuation: # Remove punctuations if s in message: message = message.replace(s, " ") stop = stopwords.words('english') for word in message.split(" "): # Remove stopwords if not word in stop: all_words.append(word.lower()) print("Bag of words created.") random.shuffle(all_messages) random.shuffle(all_messages) random.shuffle(all_messages) all_words = nltk.FreqDist(all_words) word_features = list(all_words.keys())[:2000] # top 2000 words are our features print("\nCreating feature set....") featureset = [(find_feature(word_features, message), category) for (message, category) in all_messages] print("Feature set created.") trainingset = featureset[:int(len(featureset)*3/4)] testingset = featureset[int(len(featureset)*3/4):] print("\nLength of feature set ", len(featureset)) print("Length of training set ", len(trainingset)) print("Length of testing set ", len(testingset))
Now that looks dirty. Damn ugly. Let us put it into a function for better readability

import nltk from nltk.corpus import stopwords import string def find_feature(word_features, message): # find features of a message feature = {} for word in word_features: feature[word] = word in message.lower() return feature def create_training_testing(): with open('SMSSpamCollection') as f: messages = f.read().split('\n') print("Creating bag of words....") all_messages = [] # stores all the messages along with their classification all_words = [] # bag of words for message in messages: if message.split('\t')[0] == "spam": all_messages.append([message.split('\t')[1], "spam"]) else: all_messages.append([message.split('\t')[1], "ham"]) for s in string.punctuation: # Remove punctuations if s in message: message = message.replace(s, " ") stop = stopwords.words('english') for word in message.split(" "): # Remove stopwords if not word in stop: all_words.append(word.lower()) print("Bag of words created.") random.shuffle(all_messages) random.shuffle(all_messages) random.shuffle(all_messages) all_words = nltk.FreqDist(all_words) word_features = list(all_words.keys())[:2000] # top 2000 words are our features print("\nCreating feature set....") featureset = [(find_feature(word_features, message), category) for (message, category) in all_messages] print("Feature set created.") trainingset = featureset[:int(len(featureset)*3/4)] testingset = featureset[int(len(featureset)*3/4):] print("\nLength of feature set ", len(featureset)) print("Length of training set ", len(trainingset)) print("Length of testing set ", len(testingset)) return word_features, featureset, trainingset, testingset
With that out of the way we can now create our models. In this project we will be creating 5 different algorithms to train 5 different models. The algorithms are-
  • Naive Bayes
  • Multinomial Naive Bayes
  • Bernoulli Naive Bayes
  • Stochastic Gradient Descent
  • Logistic Regression

Oh no!!! Algorithms.... Maths..... I think I am done with this tutorial...... 

Do not worry about the algorithms. You do not have to write the algorithm on your own from scratch. Scikit-Learn provides us with a large number algorithms for data science and data mining. So it is not necessary for you to know the algorithms and it is very easy. But having a knowledge in it is definitely helpful.
Enough said... Let us train the five models using our algorithm and check their accuracy against the testing set.
def create_mnb_classifier(trainingset, testingset): # Multinomial Naive Bayes Classifier print("\nMultinomial Naive Bayes classifier is being trained and created...") MNB_classifier = SklearnClassifier(MultinomialNB()) MNB_classifier.train(trainingset) accuracy = nltk.classify.accuracy(MNB_classifier, testingset)*100 print("MultinomialNB Classifier accuracy = " + str(accuracy)) return MNB_classifier def create_bnb_classifier(trainingset, testingset): # Bernoulli Naive Bayes Classifier print("\nBernoulli Naive Bayes classifier is being trained and created...") BNB_classifier = SklearnClassifier(BernoulliNB()) BNB_classifier.train(trainingset) accuracy = nltk.classify.accuracy(BNB_classifier, testingset)*100 print("BernoulliNB accuracy percent = " + str(accuracy)) return BNB_classifier def create_logistic_regression_classifier(trainingset, testingset): # Logistic Regression classifier print("\nLogistic Regression classifier is being trained and created...") LogisticRegression_classifier = SklearnClassifier(LogisticRegression()) LogisticRegression_classifier.train(trainingset) print("Logistic Regression classifier accuracy = "+ str((nltk.classify.accuracy(LogisticRegression_classifier, testingset))*100)) return LogisticRegression_classifier def create_sgd_classifier(trainingset, testingset): print("\nSGD classifier is being trained and created...") SGDClassifier_classifier = SklearnClassifier(SGDClassifier()) SGDClassifier_classifier.train(trainingset) print("SGD Classifier classifier accuracy = " + str((nltk.classify.accuracy(SGDClassifier_classifier, testingset))*100)) return SGDClassifier_classifier def create_nb_classifier(trainingset, testingset): # Naive Bayes Classifier print("\nNaive Bayes classifier is being trained and created...") NB_classifier = nltk.NaiveBayesClassifier.train(trainingset) accuracy = nltk.classify.accuracy(NB_classifier, testingset)*100 print("Naive Bayes Classifier accuracy = " + str(accuracy)) NB_classifier.show_most_informative_features(20) return NB_classifier
See I told you it is that easy.

Now let us create and call a main function that integrates and calls the above modules systematically. To do that-
def main(): """ this function is used to show how to use this program. the models can be pickled if wanted or needed. i have used 4 mails to check if my models are working correctly. """ word_features, featureset, trainingset, testingset = create_training_testing() NB_classifier = create_nb_classifier(trainingset, testingset) MNB_classifier = create_mnb_classifier(trainingset, testingset) BNB_classifier = create_bnb_classifier(trainingset, testingset) LR_classifier = create_logistic_regression_classifier(trainingset, testingset) SGD_classifier = create_sgd_classifier(trainingset, testingset) mails = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",\ "Hello Ward, It has been almost 3 months since i have written you. Hope you are well.", \ "FREE FREE FREE Get a chance to win 10000 $ for free. Also get a chance to win a car and a house",\ "Hello my friend, How are you? It is has been 3 months since we talked. Hope you are well. Can we meet at my place?"] print("\n") print("Naive Bayes") print("-----------") for mail in mails: feature = find_feature(word_features, mail) print(NB_classifier.classify(feature)) print("\n") print("Multinomial Naive Bayes") print("-----------") for mail in mails: feature = find_feature(word_features, mail) print(MNB_classifier.classify(feature)) print("\n") print("Bernoulli Naive Bayes") print("-----------") for mail in mails: feature = find_feature(word_features, mail) print(BNB_classifier.classify(feature)) print("\n") print("Logistic Regression") print("-----------") for mail in mails: feature = find_feature(word_features, mail) print(LR_classifier.classify(feature)) print("\n") print("Stochastic Gradient Descent") print("-----------") for mail in mails: feature = find_feature(word_features, mail) print(SGD_classifier.classify(feature)) main()

What I am doing here is that, I have taken 4 mails/messages to check which of them are spam and which of them are ham. To do that-

  1. Take each mail at a time.
  2. Find the feature set for it.
  3. Use the feature set with different classifiers to see what the message is i.e spam or ham.
So the whole program looks something like this now-

import nltk import random import os from nltk.corpus import stopwords from nltk.classify.scikitlearn import SklearnClassifier from sklearn.naive_bayes import MultinomialNB,BernoulliNB from sklearn.linear_model import LogisticRegression,SGDClassifier import string import warnings warnings.simplefilter(action='ignore', category=FutureWarning) # For clearing the screen if os.name == 'nt': clear_screen = "cls" else: clear_screen = "clear" os.system(clear_screen) def find_feature(word_features, message): # find features of a message feature = {} for word in word_features: feature[word] = word in message.lower() return feature def create_mnb_classifier(trainingset, testingset): # Multinomial Naive Bayes Classifier print("\nMultinomial Naive Bayes classifier is being trained and created...") MNB_classifier = SklearnClassifier(MultinomialNB()) MNB_classifier.train(trainingset) accuracy = nltk.classify.accuracy(MNB_classifier, testingset)*100 print("MultinomialNB Classifier accuracy = " + str(accuracy)) return MNB_classifier def create_bnb_classifier(trainingset, testingset): # Bernoulli Naive Bayes Classifier print("\nBernoulli Naive Bayes classifier is being trained and created...") BNB_classifier = SklearnClassifier(BernoulliNB()) BNB_classifier.train(trainingset) accuracy = nltk.classify.accuracy(BNB_classifier, testingset)*100 print("BernoulliNB accuracy percent = " + str(accuracy)) return BNB_classifier def create_logistic_regression_classifier(trainingset, testingset): # Logistic Regression classifier print("\nLogistic Regression classifier is being trained and created...") LogisticRegression_classifier = SklearnClassifier(LogisticRegression()) LogisticRegression_classifier.train(trainingset) print("Logistic Regression classifier accuracy = "+ str((nltk.classify.accuracy(LogisticRegression_classifier, testingset))*100)) return LogisticRegression_classifier def create_sgd_classifier(trainingset, testingset): print("\nSGD classifier is being trained and created...") SGDClassifier_classifier = SklearnClassifier(SGDClassifier()) SGDClassifier_classifier.train(trainingset) print("SGD Classifier classifier accuracy = " + str((nltk.classify.accuracy(SGDClassifier_classifier, testingset))*100)) return SGDClassifier_classifier def create_nb_classifier(trainingset, testingset): # Naive Bayes Classifier print("\nNaive Bayes classifier is being trained and created...") NB_classifier = nltk.NaiveBayesClassifier.train(trainingset) accuracy = nltk.classify.accuracy(NB_classifier, testingset)*100 print("Naive Bayes Classifier accuracy = " + str(accuracy)) NB_classifier.show_most_informative_features(20) return NB_classifier def create_training_testing(): """ function that creates the feature set, training set, and testing set """ with open('SMSSpamCollection') as f: messages = f.read().split('\n') print("Creating bag of words....") all_messages = [] # stores all the messages along with their classification all_words = [] # bag of words for message in messages: if message.split('\t')[0] == "spam": all_messages.append([message.split('\t')[1], "spam"]) else: all_messages.append([message.split('\t')[1], "ham"]) for s in string.punctuation: # Remove punctuations if s in message: message = message.replace(s, " ") stop = stopwords.words('english') for word in message.split(" "): # Remove stopwords if not word in stop: all_words.append(word.lower()) print("Bag of words created.") random.shuffle(all_messages) random.shuffle(all_messages) random.shuffle(all_messages) all_words = nltk.FreqDist(all_words) word_features = list(all_words.keys())[:2000] # top 2000 words are our features print("\nCreating feature set....") featureset = [(find_feature(word_features, message), category) for (message, category) in all_messages] print("Feature set created.") trainingset = featureset[:int(len(featureset)*3/4)] testingset = featureset[int(len(featureset)*3/4):] print("\nLength of feature set ", len(featureset)) print("Length of training set ", len(trainingset)) print("Length of testing set ", len(testingset)) return word_features, featureset, trainingset, testingset def main(): """ this function is used to show how to use this program. the models can be pickled if wanted or needed. i have used 4 mails to check if my models are working correctly. """ word_features, featureset, trainingset, testingset = create_training_testing() NB_classifier = create_nb_classifier(trainingset, testingset) MNB_classifier = create_mnb_classifier(trainingset, testingset) BNB_classifier = create_bnb_classifier(trainingset, testingset) LR_classifier = create_logistic_regression_classifier(trainingset, testingset) SGD_classifier = create_sgd_classifier(trainingset, testingset) mails = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",\ "Hello Ward, It has been almost 3 months since i have written you. Hope you are well.", \ "FREE FREE FREE Get a chance to win 10000 $ for free. Also get a chance to win a car and a house",\ "Hello my friend, How are you? It is has been 3 months since we talked. Hope you are well. Can we meet at my place?"] print("\n") print("Naive Bayes") print("-----------") for mail in mails: feature = find_feature(word_features, mail) print(NB_classifier.classify(feature)) print("\n") print("Multinomial Naive Bayes") print("-----------") for mail in mails: feature = find_feature(word_features, mail) print(MNB_classifier.classify(feature)) print("\n") print("Bernoulli Naive Bayes") print("-----------") for mail in mails: feature = find_feature(word_features, mail) print(BNB_classifier.classify(feature)) print("\n") print("Logistic Regression") print("-----------") for mail in mails: feature = find_feature(word_features, mail) print(LR_classifier.classify(feature)) print("\n") print("Stochastic Gradient Descent") print("-----------") for mail in mails: feature = find_feature(word_features, mail) print(SGD_classifier.classify(feature)) main()
Your code due to some reason does not look like this? That is because I have added some extra lines to make the output look clearer and remove the warnings by Sklearn.

That's it for this tutorial.

Source Code

Thursday, 21 September 2017

Spam Detection using Machine Learning in Python Part 2 - Feature Extraction

Hope your computer has been now fully setup. If not see my previous post here. In this part we will be learning the steps that will be followed to create our spam detection system, what features are and how they can be extracted from sentences. In that process we might also learn something about machine learning as well.

Steps that will be followed

  1. Get a good dataset which contains a lot of "spam" and "ham" messages.
  2. Get each and every message and then create a bag of words
  3. Extract the features from the bag of words
  4. Fill up the feature set
  5. Train and test a model
  6. Store the model for later use(optional)

Step 1:- Get a "spam" and "ham" dataset

Since in machine learning we need to teach our model which message is "spam" and which message is "ham", we need a get dataset that exactly has that. In my case I have used the dataset provided in here https://archive.ics.uci.edu/ml/machine-learning-databases/00228/. Here each message is classified as either spam or ham. Extract it to a folder and you will find a file called SMSSpamCollection. The format of the classification is like this-
<classification><tab><message>

Step 2:- Creating a bag of words

The raw dataset cannot be fed to the algorithm which will train our model. Hence, we need to create a bag of words from which we will create our feature set. But let us first get our messages and the bag of words:-

import nltk from nltk.corpus import stopwords import string with open('SMSSpamCollection') as f: messages = f.read().split('\n') print("Creating bag of words....") all_messages = [] # stores all the messages along with their classification all_words = [] # bag of words for message in messages: if message.split('\t')[0] == "spam": all_messages.append([message.split('\t')[1], "spam"]) else: all_messages.append([message.split('\t')[1], "ham"]) for s in string.punctuation: # Remove punctuations if s in message: message = message.replace(s, " ") stop = stopwords.words('english') for word in message.split(" "): # Remove stopwords if not word in stop: all_words.append(word.lower()) print("Bag of words created.")

Ok. This might be a lot to take in all at once. Here's a breakdown:-
  1. Line 1-3 is for necessary imports.
  2. Line 5-6 reads the SMSSpamCollection file and stores each message in the messages list. Each message is in the format <classification><tab><message>
  3. Line 9-10 defines 2 empty list all_messages and all_words that will contain all the messages along with their classification and all the words except English stopwords respectively.
  4. Line 11-15 stores each message in all_messages along with the message's classification i.e spam or ham
  5. Line 17-19 removes all the punctuation in each message
  6. Line 21-24 takes each word from each message, converts it to lowercase and then appends the word to all_words if the word is not a stopword.

Step 3:- Extracting features

Now that we have the bag of words, we can now extract features from it. So what are these features. Features can be considered as properties of the sentence, in this case. So here the features of a sentence are the words in it. But since, every sentence has different words in it, it is useless to take every word in every sentence as it will make our feature set unnecessarily complicated. Hence, the best way to choose our features are to take the top used words i.e. the words that are used the most. 
Fortunately NLTK, provides us with just that so that we don't have to write functions for finding the top used words. Lets use it to find our features which will be our top 2000 words.

all_words = nltk.FreqDist(all_words) word_features = list(all_words.keys())[:2000] # top 2000 words are our features

all_words
now contains the words in the bag of words in descending order according to their frequency.
word_features contains the top 2000 words.

End of Part 2....

Now that we have extracted our features successfully, we can now proceed to the next part where we fill up our feature data set and train and test few models. Goodbye for now......

Wednesday, 20 September 2017

Spam Detection using Machine Learning in Python Part 1 - Setting up your computer

How Spam Detection works?

A spam, according to Google is an "irrelevant or unsolicited messages sent over the Internet, typically to a large number of users, for the purposes of advertising, phishing, spreading malware, etc.". We receive these messages on our mail boxes almost daily. But they do not stop there. They keep on coming to our inbox until we either respond to them or put them in our spam box which Google Mail or Yahoo Mail or whatever you use, learns about it and puts them in the spam box as soon as another spam mail comes to us.
So how does the mail service provider know if it is a spam mail or not? The answer is machine learning. In machine learning we program a computer such that it can get into a self learning mode, so that when the computer gets a new data it can learn, adapt and grow from it.

A newbie's guide to Machine Learning in Python

Though machine learning is a highly advanced topic, I will still try my level best to keep this as newbie friendly as possible. So how do you setup your computer? Follow these simple steps-
  1. Install Python in your PC. Look here for complete steps https://evilporthacker.blogspot.in/2017/09/installing-python-in-your-windows.html
  2. Install a good text editor like Sublime Text 3 https://www.sublimetext.com/3
  3. Open up command prompt.
  4. Type the following commands to install the important modules in Python-
    1. pip install nltk
    2. pip install numpy
    3. pip install scipy
    4. pip install sklearn
  5. Though a proper, real life machine learning + natural language processing project requires more modules, for this project you can install these 4 only.

Brief explanation of the work of the libraries

  • NLTK (Natural Language Toolkit) is a library that has awesome tools for human language processing. It provides easy-to-use interfaces to over 50 corpora and lexical resources. In simple words, it a Swiss army knife for Natural Language Processing.
  • Sklearn (Scikit Learn) is a library that has simple yet efficient tools for data analysis and mining. I consider it to be the greatest library for machine learning in Python.
  • Numpy provides great tools for numerical operations like matrix manipulation, fourier transform, linear algebra etc.
  • Scipy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. (Could not get words to describe it. So just copied it from the website😅).

Conclusion

With your computer now fully setup, you can start this project by going to the second part where I will be showing you how to get the training data for the models and all other stuff we will be training. I included all the links to the libraries just in case you want to know something more about the modules. Goodbye for now.......😙

Gesture driven Virtual Keyboard using OpenCV + Python

Hello Readers, long time no see. In this tutorial, I will be teaching you how to create a gesture driven Virtual Keyboard with OpenCV and P...