Collaboration Infrastructure in Data Science: Tools, Challenges, and Best Practices

YouTube

Description

We have mature collaboration tools in the PyData ecosystem like JupyterHub for shared infrastructure, conda for package & environment management, and Dask for distributed computing. However, the process for setting-up and using a platform with all these tools requires in-depth knowledge of these tools. This talk hopes to discuss some friendly solutions for collaborative practices like:

Sharing ongoing work, visualizations, and dashboards with reproducible environments Designing for scalability (distributed compute) and productionization Monitoring and managing team resources to minimize cloud costs

Pre-requisites: A basic understanding of Python-based data science tools (NumPy, pandas, matplotlib, etc.) and workflows (exploratory analysis, visualization, etc.) is required – if you have used Jupyter Notebooks, created environments using the conda package manager, and performed a groupby operation in pandas, you should be able to follow along with the talk comfortably. While not necessary, experiential knowledge of data workflows, previous experience working in a team, and familiarity with distributed computing principles will help you get the most value out of this talk.

PyVideo

Collaboration Infrastructure in Data Science: Tools, Challenges, and Best Practices

Description

Details