Black Friday Sales Prediction

This is a hackathon that I participated in AnalyticsVidhya.com. A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase amount from last month.

Github Link to Project

Data Preparation:

The data has a lot of features that are categorical in nature. However we would need to convert them to dummy indicator variables. For example there is a data element called Age which has bins like 0-17, 18-25, 26-35 etc. Luckily pandas has a get_dummies method which easily converts these variables into dummy indicator variables.

There is a variable called Product category. However a product can belong to multiple product categories. So for each product there can exist a second and third product category. We need to write a function to basically look at these 3 product categories and a create a single dummy indicator variable.

Basic Model:

Initially for this problem I use a basic RandomForestRegressor with default parameters just to see how everything shapes out. I split the training data using the train_test_split method to test the accuracy of the model.

Tuning Hyperparameters:

Another useful model for this scenario is XGBoost Regressor. I ran it with default parameters and observed better results than using the Random Forest model. We can use the GridSearchCV function which does a exhaustive search over a list of hyperparameters and identifies the parameters which will provide the best result. An important point to note is that this Grid search can be quite resource intensive and time consuming. If you don’t have a lot of time you can reduce the number of possible hyperparameters to make your search faster. You can also use the verbose parameter in the method to see the progress of the function.

The best_estimator_ attribute tells which hyperparameters has the best results. These parameters are used to create our final prediction. Ab RMSE of 2813.67 was achieved using this model.

Leave a comment