Welcome back to Part 3 of the tutorial. In this part we will be creating our feature set and training and testing our models. If you have not watched the previous part look here. Part 2 is really important.
Let's jump straight into today's part.
Here S1, S2 ... Sn are the messages or sentences
A1, A2, A3, .... A2000 are the features.
Result is the classification of the sentence or message S which are stored in all_messages.
Now let us create and call a main function that integrates and calls the above modules systematically. To do that-
What I am doing here is that, I have taken 4 mails/messages to check which of them are spam and which of them are ham. To do that-
That's it for this tutorial.
Let's jump straight into today's part.
Step 4:- Creating the feature data set
From the previous part we had seen that the top 2000 words are our features. But only the features won't be enough for us. Every feature needs to have some value for a particular sentence. Since the our features are the top 2000 words among the bag of words our features can have one of 2 values. They are:-
- True - If the feature is present in the sentence.
- False - If the feature is not present in the sentence.
def find_feature(word_features, message):
# find features of a message
feature = {}
for word in word_features:
feature[word] = word in message.lower()
return feature
Let us call this function repetitively for every message in all_message variable to create our feature set.
random.shuffle(all_messages)
random.shuffle(all_messages)
random.shuffle(all_messages)
print("\nCreating feature set....")
featureset = [(find_feature(word_features, message), category) for (message, category) in all_messages]
print("Feature set created.")
trainingset = featureset[:int(len(featureset)*3/4)]
testingset = featureset[int(len(featureset)*3/4):]
print("\nLength of feature set ", len(featureset))
print("Length of training set ", len(trainingset))
print("Length of testing set ", len(testingset))
What I did here is I took all_messages and gave it a good shuffle to remove any bias. Then I called the find_feature function for all the messages in all_messages. Then I split the featureset variable in 2 parts. The first 3/4th is used to train our models and the rest 1/4th is used to set our models. So how does our feature set look like? Something like thisHere S1, S2 ... Sn are the messages or sentences
A1, A2, A3, .... A2000 are the features.
Result is the classification of the sentence or message S which are stored in all_messages.
Step 5:- Training and Testing
Now that we have our training and testing set we can now train our models. But presently our program looks something like this-
import nltk
from nltk.corpus import stopwords
import string
def find_feature(word_features, message):
# find features of a message
feature = {}
for word in word_features:
feature[word] = word in message.lower()
return feature
with open('SMSSpamCollection') as f:
messages = f.read().split('\n')
print("Creating bag of words....")
all_messages = [] # stores all the messages along with their classification
all_words = [] # bag of words
for message in messages:
if message.split('\t')[0] == "spam":
all_messages.append([message.split('\t')[1], "spam"])
else:
all_messages.append([message.split('\t')[1], "ham"])
for s in string.punctuation: # Remove punctuations
if s in message:
message = message.replace(s, " ")
stop = stopwords.words('english')
for word in message.split(" "): # Remove stopwords
if not word in stop:
all_words.append(word.lower())
print("Bag of words created.")
random.shuffle(all_messages)
random.shuffle(all_messages)
random.shuffle(all_messages)
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:2000] # top 2000 words are our features
print("\nCreating feature set....")
featureset = [(find_feature(word_features, message), category) for (message, category) in all_messages]
print("Feature set created.")
trainingset = featureset[:int(len(featureset)*3/4)]
testingset = featureset[int(len(featureset)*3/4):]
print("\nLength of feature set ", len(featureset))
print("Length of training set ", len(trainingset))
print("Length of testing set ", len(testingset))
Now that looks dirty. Damn ugly. Let us put it into a function for better readability
import nltk
from nltk.corpus import stopwords
import string
def find_feature(word_features, message):
# find features of a message
feature = {}
for word in word_features:
feature[word] = word in message.lower()
return feature
def create_training_testing():
with open('SMSSpamCollection') as f:
messages = f.read().split('\n')
print("Creating bag of words....")
all_messages = [] # stores all the messages along with their classification
all_words = [] # bag of words
for message in messages:
if message.split('\t')[0] == "spam":
all_messages.append([message.split('\t')[1], "spam"])
else:
all_messages.append([message.split('\t')[1], "ham"])
for s in string.punctuation: # Remove punctuations
if s in message:
message = message.replace(s, " ")
stop = stopwords.words('english')
for word in message.split(" "): # Remove stopwords
if not word in stop:
all_words.append(word.lower())
print("Bag of words created.")
random.shuffle(all_messages)
random.shuffle(all_messages)
random.shuffle(all_messages)
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:2000] # top 2000 words are our features
print("\nCreating feature set....")
featureset = [(find_feature(word_features, message), category) for (message, category) in all_messages]
print("Feature set created.")
trainingset = featureset[:int(len(featureset)*3/4)]
testingset = featureset[int(len(featureset)*3/4):]
print("\nLength of feature set ", len(featureset))
print("Length of training set ", len(trainingset))
print("Length of testing set ", len(testingset))
return word_features, featureset, trainingset, testingset
With that out of the way we can now create our models. In this project we will be creating 5 different algorithms to train 5 different models. The algorithms are-- Naive Bayes
- Multinomial Naive Bayes
- Bernoulli Naive Bayes
- Stochastic Gradient Descent
- Logistic Regression
Oh no!!! Algorithms.... Maths..... I think I am done with this tutorial......
Do not worry about the algorithms. You do not have to write the algorithm on your own from scratch. Scikit-Learn provides us with a large number algorithms for data science and data mining. So it is not necessary for you to know the algorithms and it is very easy. But having a knowledge in it is definitely helpful.
Enough said... Let us train the five models using our algorithm and check their accuracy against the testing set.
def create_mnb_classifier(trainingset, testingset):
# Multinomial Naive Bayes Classifier
print("\nMultinomial Naive Bayes classifier is being trained and created...")
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(trainingset)
accuracy = nltk.classify.accuracy(MNB_classifier, testingset)*100
print("MultinomialNB Classifier accuracy = " + str(accuracy))
return MNB_classifier
def create_bnb_classifier(trainingset, testingset):
# Bernoulli Naive Bayes Classifier
print("\nBernoulli Naive Bayes classifier is being trained and created...")
BNB_classifier = SklearnClassifier(BernoulliNB())
BNB_classifier.train(trainingset)
accuracy = nltk.classify.accuracy(BNB_classifier, testingset)*100
print("BernoulliNB accuracy percent = " + str(accuracy))
return BNB_classifier
def create_logistic_regression_classifier(trainingset, testingset):
# Logistic Regression classifier
print("\nLogistic Regression classifier is being trained and created...")
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(trainingset)
print("Logistic Regression classifier accuracy = "+ str((nltk.classify.accuracy(LogisticRegression_classifier, testingset))*100))
return LogisticRegression_classifier
def create_sgd_classifier(trainingset, testingset):
print("\nSGD classifier is being trained and created...")
SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(trainingset)
print("SGD Classifier classifier accuracy = " + str((nltk.classify.accuracy(SGDClassifier_classifier, testingset))*100))
return SGDClassifier_classifier
def create_nb_classifier(trainingset, testingset):
# Naive Bayes Classifier
print("\nNaive Bayes classifier is being trained and created...")
NB_classifier = nltk.NaiveBayesClassifier.train(trainingset)
accuracy = nltk.classify.accuracy(NB_classifier, testingset)*100
print("Naive Bayes Classifier accuracy = " + str(accuracy))
NB_classifier.show_most_informative_features(20)
return NB_classifier
See I told you it is that easy.
Now let us create and call a main function that integrates and calls the above modules systematically. To do that-
def main():
"""
this function is used to show how to use this program.
the models can be pickled if wanted or needed.
i have used 4 mails to check if my models are working correctly.
"""
word_features, featureset, trainingset, testingset = create_training_testing()
NB_classifier = create_nb_classifier(trainingset, testingset)
MNB_classifier = create_mnb_classifier(trainingset, testingset)
BNB_classifier = create_bnb_classifier(trainingset, testingset)
LR_classifier = create_logistic_regression_classifier(trainingset, testingset)
SGD_classifier = create_sgd_classifier(trainingset, testingset)
mails = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",\
"Hello Ward, It has been almost 3 months since i have written you. Hope you are well.", \
"FREE FREE FREE Get a chance to win 10000 $ for free. Also get a chance to win a car and a house",\
"Hello my friend, How are you? It is has been 3 months since we talked. Hope you are well. Can we meet at my place?"]
print("\n")
print("Naive Bayes")
print("-----------")
for mail in mails:
feature = find_feature(word_features, mail)
print(NB_classifier.classify(feature))
print("\n")
print("Multinomial Naive Bayes")
print("-----------")
for mail in mails:
feature = find_feature(word_features, mail)
print(MNB_classifier.classify(feature))
print("\n")
print("Bernoulli Naive Bayes")
print("-----------")
for mail in mails:
feature = find_feature(word_features, mail)
print(BNB_classifier.classify(feature))
print("\n")
print("Logistic Regression")
print("-----------")
for mail in mails:
feature = find_feature(word_features, mail)
print(LR_classifier.classify(feature))
print("\n")
print("Stochastic Gradient Descent")
print("-----------")
for mail in mails:
feature = find_feature(word_features, mail)
print(SGD_classifier.classify(feature))
main()
What I am doing here is that, I have taken 4 mails/messages to check which of them are spam and which of them are ham. To do that-
- Take each mail at a time.
- Find the feature set for it.
- Use the feature set with different classifiers to see what the message is i.e spam or ham.
import nltk
import random
import os
from nltk.corpus import stopwords
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.linear_model import LogisticRegression,SGDClassifier
import string
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# For clearing the screen
if os.name == 'nt':
clear_screen = "cls"
else:
clear_screen = "clear"
os.system(clear_screen)
def find_feature(word_features, message):
# find features of a message
feature = {}
for word in word_features:
feature[word] = word in message.lower()
return feature
def create_mnb_classifier(trainingset, testingset):
# Multinomial Naive Bayes Classifier
print("\nMultinomial Naive Bayes classifier is being trained and created...")
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(trainingset)
accuracy = nltk.classify.accuracy(MNB_classifier, testingset)*100
print("MultinomialNB Classifier accuracy = " + str(accuracy))
return MNB_classifier
def create_bnb_classifier(trainingset, testingset):
# Bernoulli Naive Bayes Classifier
print("\nBernoulli Naive Bayes classifier is being trained and created...")
BNB_classifier = SklearnClassifier(BernoulliNB())
BNB_classifier.train(trainingset)
accuracy = nltk.classify.accuracy(BNB_classifier, testingset)*100
print("BernoulliNB accuracy percent = " + str(accuracy))
return BNB_classifier
def create_logistic_regression_classifier(trainingset, testingset):
# Logistic Regression classifier
print("\nLogistic Regression classifier is being trained and created...")
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(trainingset)
print("Logistic Regression classifier accuracy = "+ str((nltk.classify.accuracy(LogisticRegression_classifier, testingset))*100))
return LogisticRegression_classifier
def create_sgd_classifier(trainingset, testingset):
print("\nSGD classifier is being trained and created...")
SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(trainingset)
print("SGD Classifier classifier accuracy = " + str((nltk.classify.accuracy(SGDClassifier_classifier, testingset))*100))
return SGDClassifier_classifier
def create_nb_classifier(trainingset, testingset):
# Naive Bayes Classifier
print("\nNaive Bayes classifier is being trained and created...")
NB_classifier = nltk.NaiveBayesClassifier.train(trainingset)
accuracy = nltk.classify.accuracy(NB_classifier, testingset)*100
print("Naive Bayes Classifier accuracy = " + str(accuracy))
NB_classifier.show_most_informative_features(20)
return NB_classifier
def create_training_testing():
"""
function that creates the feature set, training set, and testing set
"""
with open('SMSSpamCollection') as f:
messages = f.read().split('\n')
print("Creating bag of words....")
all_messages = [] # stores all the messages along with their classification
all_words = [] # bag of words
for message in messages:
if message.split('\t')[0] == "spam":
all_messages.append([message.split('\t')[1], "spam"])
else:
all_messages.append([message.split('\t')[1], "ham"])
for s in string.punctuation: # Remove punctuations
if s in message:
message = message.replace(s, " ")
stop = stopwords.words('english')
for word in message.split(" "): # Remove stopwords
if not word in stop:
all_words.append(word.lower())
print("Bag of words created.")
random.shuffle(all_messages)
random.shuffle(all_messages)
random.shuffle(all_messages)
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:2000] # top 2000 words are our features
print("\nCreating feature set....")
featureset = [(find_feature(word_features, message), category) for (message, category) in all_messages]
print("Feature set created.")
trainingset = featureset[:int(len(featureset)*3/4)]
testingset = featureset[int(len(featureset)*3/4):]
print("\nLength of feature set ", len(featureset))
print("Length of training set ", len(trainingset))
print("Length of testing set ", len(testingset))
return word_features, featureset, trainingset, testingset
def main():
"""
this function is used to show how to use this program.
the models can be pickled if wanted or needed.
i have used 4 mails to check if my models are working correctly.
"""
word_features, featureset, trainingset, testingset = create_training_testing()
NB_classifier = create_nb_classifier(trainingset, testingset)
MNB_classifier = create_mnb_classifier(trainingset, testingset)
BNB_classifier = create_bnb_classifier(trainingset, testingset)
LR_classifier = create_logistic_regression_classifier(trainingset, testingset)
SGD_classifier = create_sgd_classifier(trainingset, testingset)
mails = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",\
"Hello Ward, It has been almost 3 months since i have written you. Hope you are well.", \
"FREE FREE FREE Get a chance to win 10000 $ for free. Also get a chance to win a car and a house",\
"Hello my friend, How are you? It is has been 3 months since we talked. Hope you are well. Can we meet at my place?"]
print("\n")
print("Naive Bayes")
print("-----------")
for mail in mails:
feature = find_feature(word_features, mail)
print(NB_classifier.classify(feature))
print("\n")
print("Multinomial Naive Bayes")
print("-----------")
for mail in mails:
feature = find_feature(word_features, mail)
print(MNB_classifier.classify(feature))
print("\n")
print("Bernoulli Naive Bayes")
print("-----------")
for mail in mails:
feature = find_feature(word_features, mail)
print(BNB_classifier.classify(feature))
print("\n")
print("Logistic Regression")
print("-----------")
for mail in mails:
feature = find_feature(word_features, mail)
print(LR_classifier.classify(feature))
print("\n")
print("Stochastic Gradient Descent")
print("-----------")
for mail in mails:
feature = find_feature(word_features, mail)
print(SGD_classifier.classify(feature))
main()
Your code due to some reason does not look like this? That is because I have added some extra lines to make the output look clearer and remove the warnings by Sklearn.That's it for this tutorial.
No comments:
Post a Comment