Description
This talk aims to answer a few questions:
- What do you do when you need to move your model from your laptop to production?
- Is big data == I need to use JVM the right assumption?
- How can I put my jupyter notebook in production?
- How do you apply the best software engineering practices (testing and ci for example) inside your data science process?
- How do you “decouple” your data scientists, developers and devops teams?
- How do you guarantee the reproducibility of your models?
- How do you scale your training process when does not fit in memory anymore?
- How do you serve your models and provide an easy rollback system?
The Agenda:
- The Data Science workflow
- Scaling is not just a matter of the size of your Data
- Scaling when the size of your Data matters
- DDS, Dockerized Data Science
- Cassiny
I’ll share my experience highlighting some of the challenges I faced and the solutions I came up to answer these questions.
During this presentation I will mention libraries like jupyter, atom, scikit- learn, dask, ray, parquet, arrow and many others.
The principles and best practices I will share are something that you can apply, more or less easily, if you are running or in the process to run a production system based on the Python stack.
This talk will focus on (my) best practices to run the Python Data stack together and I will also talk about Cassiny, an open source project I started, that aims to simplify your life if you want to use a completely Python based solution in your data science workflow.
in __on Friday 20 April at 11:00 **See schedule**