Description
In this talk I will describe a system that we've built for doing hierarchical text classification. I will describe the logical setup of the various steps involved: data processing, feature selection, training, validation and labelling. To make this all work in practice we've mapped the setup onto a Hadoop cluster. I'll discuss some of the pro's and con's that we've run into when working with Python and Hadoop. Finally, I'll discuss how we use crowdsourcing to continuously improve the quality of our hierarchical classifier.