Scaling your Data infrastructure

YouTube

Description

This talk aims to answer a few questions:

What do you do when you need to move your model from your laptop to production?
Is big data == I need to use JVM the right assumption?
How can I put my jupyter notebook in production?
How do you apply the best software engineering practices (testing and ci for example) inside your data science process?
How do you “decouple” your data scientists, developers and devops teams?
How do you guarantee the reproducibility of your models?
How do you scale your training process when does not fit in memory anymore?
How do you serve your models and provide an easy rollback system?

The Agenda:

The Data Science workflow
Scaling is not just a matter of the size of your Data
Scaling when the size of your Data matters
DDS, Dockerized Data Science
Cassiny

I’ll share my experience highlighting some of the challenges I faced and the solutions I came up to answer these questions.

During this presentation I will mention libraries like jupyter, atom, scikit- learn, dask, ray, parquet, arrow and many others.

The principles and best practices I will share are something that you can apply, more or less easily, if you are running or in the process to run a production system based on the Python stack.

This talk will focus on (my) best practices to run the Python Data stack together and I will also talk about Cassiny, an open source project I started, that aims to simplify your life if you want to use a completely Python based solution in your data science workflow.

in __on Friday 20 April at 11:00 **See schedule**

PyVideo

Scaling your Data infrastructure

Description

Details