Description
Data clustering is a powerful tool for data analysis. It can be particularly useful in exploratory data analysis for helping to summarize and give intuition about a dataset. Despite it's power clustering is used for this task far less frequently than it could be. A plethora of options for clustering algorithms exist, and we will provide a survey of some of the more popular options, discussing their strengths and weaknesses, particularly with regard to exploratory data analysis. Our focus, however, is on a relatively new algorithm that appears to be the best equipped to meet the needs of exploratory data analysis: HDBSCAN* has the strengths of density based algorithms, has a small robust set of parameters, and with suitable implementation can be made highly scalable to large datasets. We will discuss how the algorithm works, taking a few different perspectives, and explain the techniques used for a high performance implementation. Finally we'll discuss ways to extend the algorithm, drawing on ideas from topological data analysis.