Description
Using the recent Home Depot Search Relevance Kaggle competition (https://www.kaggle.com/c/home-depot-product-search-relevance) as an example, this talk will give a general overview of a predictive analytics project in python, specifically focusing on text analytics. This talk will first give an introduction to Kaggle and this competition in particular. To give an introduction to predictive analytics, I will then show how to start designing "features" which describe the similarity of two samples of text. Using a very simple feature, the number of overlapping words, we can build our first simple predictive model. Building on this base, we'll then explore more advanced text mining techniques, such as term frequency matrices, TF-IDF, Latent Semantic Analysis, and word2vec. Finally, we can see how the final random forest model performs and what the most important features were.
I've delivered this presentation in 30 minutes at a work event already. I can tweak the talk to focus less on the introduction to predictive analytics and more on the text analytics methods if that would be a better fit for the audience.