I recently started competing in a Kaggle Homesite Quote Conversion contest.
This contest is being held by Homesite. Homesite is a provider of homeowner insurance and wants a model that allows them to predict whether a quoted price will lead to a purchase. By being able to better understand the impact of proposed pricing changes it will allow Homesite to maintain an ideal portfolio of customer segments.
The data being provided is an anonymized database of information on customer and sales activity, including property and coverage information. Each row in the data consists of a quoted number and numerous features, including the QuoteConversion_Flag which indicates whether the customer purchased the quote or not. In the data, there was a conversion rate of 18.73, so 18.73 percent of the customers purchased the quote.
There were some missing values in the data, and those were replaced with -1.
There are 299 variables in the data including:
- QuoteNumber : int 1 2 4 6 8 12 13 14 18 19 ..
- Original_Quote_Date : Date, format: "2013-08-16" "2014-04-22" "2014-08-25" "2013-04-15" ..
- QuoteConversion_Flag: int 0 0 0 0 0 0 0 0 0 0 ..
- Field6 : chr "B" "F" "F" "J" ...
Two of those variables, PropertyField6 and GeographicField10A, were constant and removed from the data.
XGBoost is short for “Extreme Gradient Boosting” and is an algorithm that is quite popular for Kaggle competition. It works by creating tree ensemble and uses the prediction scores of each individual tree are summed up to get the final score. Xgboost is similiar to randomForest but it differs in the fact that Xgboost tries to add new trees that compliments the already built ones.
Using this algorithm I was able to get a high AUC value of 0.96424.
The number 1 score on Kaggle 0.96820. There is very few marginal gain to be made. I haven't had a lot of time to focus on this contest. But there are still 2 months left to try and get a high ranking. Going forward I will try to use ensembling to see if any marginal gains can be made and possibly feature engineering.
Lessons learned so far:
- Be careful of data leakage. I forgot to remove the QuoteNumber variable on my first run and got an unrealistic value of 1.000 AUC.
- Use same factor levels for the train and test data. Before I started using the same factor levels with the test and training data I was getting an AUC value of .68. After correctly for this mistake I was getting an AUC of 0.96! An incredible increase in performance.
- More memory is better. I tried to use the ExtraTrees algorithm to see if I could get better performance, but could not get it to run due to memory limitation. I am going to try to use a sample of the training data in my next attempt in order to try and get it to work.
- Xgboost seems sensitive to initial seed because it seems to give different performance based on the initial seed. Possibly due to finding different local optimum. I will try to experiment with it more in the future.