Today there are millions of phones world wide, with millions of messages being sent and received every second. Thanks to the internet it is mow possible to send instantaneous messages instantly and for free. Companies are starting to utilize these services to connect with new consumers. Unfortunately disreputable companies have started to spam potential consumers using SMS messaging. This type of spam can cause end up costing the consumers a lot because many phone companies charge consumers a fee per SMS received.

This post explores the use of predictive modelling ,using Navies Bayes algorithm, to create a spam filter which can correctly identify between a spam and a non spam post to save consumers money.


The comes from the SMS Spam collection at

This data set has two columns. The identifies whether the message is ham(legitimate message) or a spam message. This is a sample of what the messages look like.

ham, "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine 
      there got amore wat..."
ham,  Ok lar... Joking wif u oni...
spam, Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry 
      question(std txt rate) 

By comparing the ham messages with the spam messages we see there are certain characteristics that can be used to identify whether the message is a spam or a ham message. In the sample messages above, the spam messages contains the words such as "free", and "win" while the ham messages do not. So an easy way to classify a message would be to see if the words free and win show up and to then classify that message as a spam message. The naive Bayes algorithm works in a similar manner.

The data set is available from the website in a text format, but I transformed it into a csv file so it would be easier to read into R.

spam_text<-read.csv("spam_text.csv",header=FALSE, stringsAsFactors = FALSE)

Using the str() function, we see that the data has 5572 messages with 2 features. The two features labeled V1 and V2 by R. In order to make the data easier to understanding the data is given more descriptive header names using the names function.

'data.frame': 5572 obs. of 2 variables:
 $ V1: chr "ham" "ham" "spam" "ham" ...
 $ V2: chr "Go until jurong point, crazy.. Available only in bugis n great world la e buffet ...

The two features labeled V1 and V2 by R. In order to make the data easier to understanding the data is given more descriptive header names using the names function. V1 is given the column names type and V2 is given the column name message.


Looking at the str output, it shows that the type variable is currently coded as a char variable. Since this is a categorical variable it needs to be first transformed into a factor in order to perform statistical analysis of the data using the following command. After transforming the type variable in a factor, the tables function shows that there 4825 real messages and 747 spam messages in the data set.



ham spam 
4825 747

The next step is now to clean and organize the data. There are elements within the data set such as numbers, punctuation, and white space that would produce spurious results if left untouched. For such a large data set, a community plugin called "tm" gives functionalities that makes it easy to clean the data and remove the unnecessary elements. The first step in cleaning the data is to transform it into a corpus, which is a collection of text documents, using the tm library.

library(tm) #load the library
text_corpus<-Corpus(VectorSource(spam_text$message)) # transforms the message column into a corpus 
>print(text_corpus) #print() tells us that there are 5572 message within the data set
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 5572

The tm_map function allows R a way to transform corpus objects. The code below converts all letters to lower case, remove the numbers, punctuation, white space and removes commonly used words such as and, if, such.

clean_corpus<-tm_map(text_corpus,tolower) #convert all letter to lower case to hello is the same as HELLO
clean_corpus<-tm_map(clean_corpus,removeNumbers) #remove numbers
clean_corpus<-tm_map(clean_corpus,stopwords()) # removes stop words

The next step now is to create a sparse matrix in which the rows are documents and the columns are terms. The documents in this data set would be the messages the columns would be the words, so at each excel cell the number indicates the number of times the words appears in the corresponding message in the row. The DocumentTermMatrix transforms the corpus matrix into a sparse matrix. But before the DocumentTermMatrix can be used to do so, the corpus must be transformed into a PlainTextDocuments because functions such as tolower return a character corpus which DocumentTermMatrix does not know how to handle.


Now that the data is clean and properly formatted, it is now time to split the data into the training and testing data set. For this data set I split about 80 percent of the data to be the training set and 20 percent of the data to be in the testing set.



 ham spam 
 967 145 

 ham spam 
3858 602 

The tables function shows the number of real messages and spam messages in the data set. From looking at the numbers there seems to be a good ratio of ham to spam messages in each data sets. Of course the way the Naive Bayes works is by looking at the words that appear within these messages and analyzing the trend in order to classify the messages. A simple wordcloud provides an intuitive look at the type of words that appear within spam messages and ham messages.






Spam Word Cloud


Real Messages Cloud

Looking at the word cloud, it shows that spam messages tend to contain words such as free, now, replay, and prize, while the real messages cloud does not. These different allow naive Bayes to classify each message accurately.

The final step is building a Naives Classifier is to transform the sparse matrix into something that can used to train the classifier. Currently the sparse matrix contains over 8000 features, many of which are useless in classification. The code below create a list of words that appear at least 5 times and uses the dictionary parameter to create a new document matrix that contains at appears at least 5 times.

DocumentTermMatrix (documents: 4460, terms: 8290)
Non-/sparse entries: 35507/36937893
Sparsity : 100%
Maximal term length: 40
Weighting : term frequency (tf)

spam_dict<-c(findFreqTerms(corpus_mat_train, 5))


<<DocumentTermMatrix (documents: 4460, terms: 1298)
Non-/sparse entries: 26623/5762457
Sparsity : 100%
Maximal term length: 19
Weighting : term frequency (tf)

After removing words that appear less then 5 times the number of features were reduced from 8290 to 1298. The next step is now to transform these features into categorical features because the Naive Bayes is trained on categorical features using the custom function defined below. It place a 1 where the count of the word is greater then 0 and 0 otherwise. This function is then applied to the training and testing data using the apply function.

 convert_counts <- function(x) { x <- ifelse(x > 0, 1, 0) x <- factor(x, levels = c(0, 1), labels = c(""No"", ""Yes"")) return(x) }
spam_train<-apply(spam_train, MARGIN=2, convertcounts)
spam_test<-apply(spam_test, MARGIN=2, convertcounts)

Margin=2 specifies that we want to apply the function to the columns.

Looking at the table the model was able to correctly classify the messages in over PERCENT of times. This is incredibly useful. This shows just how powerful,the Naive Bayes algorithm, is.. Using an algorithm such as this cell phones companies can block spam text and save millions of dollars. This algorithm cam be used it other ways as well such as to classify different sort of customers on their spending habits.

Training the model

It is now time to train the classifier. In order to train the classifier, the e1071 library needs to be installed and loaded into R. The e1071 library implements many machine learning algorithms including Navie Bayes. The following R code trains the Bayes Classifier.



Evaluating Model Performance

Now that the classifier has been created, we can now test the performance of the model on the test data set using the following R code.


CrossTable(spam_predict, spam_raw_test$type)

Warning message:
package ‘gmodels’ was built under R version 3.2.2
CrossTable(spam_predict, spam_raw_test$type)

   Cell Contents
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |

Total Observations in Table:  1112 

             | spam_raw_test$type 
spam_predict |       ham |      spam | Row Total | 
         ham |       963 |        17 |       980 | 
             |    14.402 |    96.049 |           | 
             |     0.983 |     0.017 |     0.881 | 
             |     0.996 |     0.117 |           | 
             |     0.866 |     0.015 |           | 
        spam |         4 |       128 |       132 | 
             |   106.927 |   713.094 |           | 
             |     0.030 |     0.970 |     0.119 | 
             |     0.004 |     0.883 |           | 
             |     0.004 |     0.115 |           | 
Column Total |       967 |       145 |      1112 | 
             |     0.870 |     0.130 |           | 

From the confusion table we can see that this model has incredible accuracy on the test data set. The classifier was able to correctly classify the message with 98 percent accuracy. This goes to show that even a simple algorithm such as Naive Bayes can produce powerful results.


The purpose of this data analysis was to create a classifier to correctly identify whether a message was a spam or not a spam message. We were able to do so with a high level of accuracy. By looking at the words that are present within the message, the classifier was able to correctly classify the message. Using a model such as this company can block spammers on their SMS networks and attract more consumers by advertising their spam filtering capabilities.
This model can be improved by looking at the messages that were classified and understanding why they were miss classified using the following command.


[1] 55 68 82 217 270 295 314 339 362 455 471 509 578 587 911
[16] 922 968 990 997 1007 1081
> spam_raw_test$type[c(55,217)] 
[1] spam spam
Levels: ham spam

> spam_predict[c(55,217)] #message with indices and 217 were miss classified at ham when they were actually spam
[1] ham ham
Levels: ham spam

> spam_raw_test$message[c(55,217)]
[1] "Money i have won wining number 946 wot do i do next" 
[2] "Hi babe its Chloe, how r u? I was smashed on saturday night, it was great! How was your weekend? 
U been missing me? SP Text stop to stop 150p/text"