Hope your computer has been now fully setup. If not see my previous post here. In this part we will be learning the steps that will be followed to create our spam detection system, what features are and how they can be extracted from sentences. In that process we might also learn something about machine learning as well.
Steps that will be followed
- Get a good dataset which contains a lot of "spam" and "ham" messages.
- Get each and every message and then create a bag of words
- Extract the features from the bag of words
- Fill up the feature set
- Train and test a model
- Store the model for later use(optional)
Step 1:- Get a "spam" and "ham" dataset
Since in machine learning we need to teach our model which message is "spam" and which message is "ham", we need a get dataset that exactly has that. In my case I have used the dataset provided in here https://archive.ics.uci.edu/ml/machine-learning-databases/00228/. Here each message is classified as either spam or ham. Extract it to a folder and you will find a file called SMSSpamCollection. The format of the classification is like this-
<classification><tab><message>
Step 2:- Creating a bag of words
The raw dataset cannot be fed to the algorithm which will train our model. Hence, we need to create a bag of words from which we will create our feature set. But let us first get our messages and the bag of words:-
import nltk
from nltk.corpus import stopwords
import string
with open('SMSSpamCollection') as f:
messages = f.read().split('\n')
print("Creating bag of words....")
all_messages = [] # stores all the messages along with their classification
all_words = [] # bag of words
for message in messages:
if message.split('\t')[0] == "spam":
all_messages.append([message.split('\t')[1], "spam"])
else:
all_messages.append([message.split('\t')[1], "ham"])
for s in string.punctuation: # Remove punctuations
if s in message:
message = message.replace(s, " ")
stop = stopwords.words('english')
for word in message.split(" "): # Remove stopwords
if not word in stop:
all_words.append(word.lower())
print("Bag of words created.")
Ok. This might be a lot to take in all at once. Here's a breakdown:-
all_words now contains the words in the bag of words in descending order according to their frequency.
word_features contains the top 2000 words.
- Line 1-3 is for necessary imports.
- Line 5-6 reads the SMSSpamCollection file and stores each message in the messages list. Each message is in the format <classification><tab><message>
- Line 9-10 defines 2 empty list all_messages and all_words that will contain all the messages along with their classification and all the words except English stopwords respectively.
- Line 11-15 stores each message in all_messages along with the message's classification i.e spam or ham
- Line 17-19 removes all the punctuation in each message
- Line 21-24 takes each word from each message, converts it to lowercase and then appends the word to all_words if the word is not a stopword.
Step 3:- Extracting features
Now that we have the bag of words, we can now extract features from it. So what are these features. Features can be considered as properties of the sentence, in this case. So here the features of a sentence are the words in it. But since, every sentence has different words in it, it is useless to take every word in every sentence as it will make our feature set unnecessarily complicated. Hence, the best way to choose our features are to take the top used words i.e. the words that are used the most.
Fortunately NLTK, provides us with just that so that we don't have to write functions for finding the top used words. Lets use it to find our features which will be our top 2000 words.
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:2000] # top 2000 words are our features
all_words now contains the words in the bag of words in descending order according to their frequency.
word_features contains the top 2000 words.
End of Part 2....
Now that we have extracted our features successfully, we can now proceed to the next part where we fill up our feature data set and train and test few models. Goodbye for now......
No comments:
Post a Comment