Contribute Media
A thank you to everyone who makes this possible: Read More

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons

Description

PyData Berlin 2016

The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?

The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?

At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.

In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.

Details

Improve this page