Wednesday, 20 September 2017

Spam Detection using Machine Learning in Python Part 1 - Setting up your computer

How Spam Detection works?

A spam, according to Google is an "irrelevant or unsolicited messages sent over the Internet, typically to a large number of users, for the purposes of advertising, phishing, spreading malware, etc.". We receive these messages on our mail boxes almost daily. But they do not stop there. They keep on coming to our inbox until we either respond to them or put them in our spam box which Google Mail or Yahoo Mail or whatever you use, learns about it and puts them in the spam box as soon as another spam mail comes to us.
So how does the mail service provider know if it is a spam mail or not? The answer is machine learning. In machine learning we program a computer such that it can get into a self learning mode, so that when the computer gets a new data it can learn, adapt and grow from it.

A newbie's guide to Machine Learning in Python

Though machine learning is a highly advanced topic, I will still try my level best to keep this as newbie friendly as possible. So how do you setup your computer? Follow these simple steps-
  1. Install Python in your PC. Look here for complete steps
  2. Install a good text editor like Sublime Text 3
  3. Open up command prompt.
  4. Type the following commands to install the important modules in Python-
    1. pip install nltk
    2. pip install numpy
    3. pip install scipy
    4. pip install sklearn
  5. Though a proper, real life machine learning + natural language processing project requires more modules, for this project you can install these 4 only.

Brief explanation of the work of the libraries

  • NLTK (Natural Language Toolkit) is a library that has awesome tools for human language processing. It provides easy-to-use interfaces to over 50 corpora and lexical resources. In simple words, it a Swiss army knife for Natural Language Processing.
  • Sklearn (Scikit Learn) is a library that has simple yet efficient tools for data analysis and mining. I consider it to be the greatest library for machine learning in Python.
  • Numpy provides great tools for numerical operations like matrix manipulation, fourier transform, linear algebra etc.
  • Scipy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. (Could not get words to describe it. So just copied it from the website😅).


With your computer now fully setup, you can start this project by going to the second part where I will be showing you how to get the training data for the models and all other stuff we will be training. I included all the links to the libraries just in case you want to know something more about the modules. Goodbye for now.......😙


  1. Nice informative blog.. Keep updating these types of informative updates regularly...Also Visit this site for Machine Learning with Python Training

  2. It is really a great work and the way in which u r sharing the knowledge is excellent.Thanks for helping me to understand basic concepts. As a beginner in machine Learning programming your post help me a lot.Thanks for your informative article.


Gesture driven Virtual Keyboard using OpenCV + Python

Hello Readers, long time no see. In this tutorial, I will be teaching you how to create a gesture driven Virtual Keyboard with OpenCV and P...