Contribute Media
A thank you to everyone who makes this possible: Read More

Easy Spark: Exploiting large datasets for multi-class classification

Description

Apache Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. At first glance, it seems that getting started with programming the Hadoop eco-system is quite cumbersome, and not so user-friendly for a data scientist or a machine learning specialist. In this talk I will briefly introduce Apache Spark, and its programming paradigm. I will show how to easily execute a distributed training of the common multi-class classifiers (naïve Bayes, random forest, logistic regression), without installing a single virtual machine, virtual box or a docker. I will share my experience of managing long-term software projects which are based on the Hadoop technology for data storage, extraction and transformation.

Details

Improve this page