Description
Apache Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. At first glance, it seems that getting started with programming the Hadoop eco-system is quite cumbersome, and not so user-friendly for a data scientist or a machine learning specialist. In this talk I will briefly introduce Apache Spark, and its programming paradigm. I will show how to easily execute a distributed training of the common multi-class classifiers (naïve Bayes, random forest, logistic regression), without installing a single virtual machine, virtual box or a docker. I will share my experience of managing long-term software projects which are based on the Hadoop technology for data storage, extraction and transformation.