Frictionless Data, Frictionless Development edit

YouTube

Description

A common problem in Data Engineering is how to create a platform capable both of importing and exporting tabular data in numerous formats and of maintaining a change history of the data while users update and query it.

Tools like Trifacta Google Cloud Dataprep provide a turnkey solution to part of the pipeline but the open source Frictionless Data tools from OKFN can provide a simpler subset of these features tailored to your requirements.

Just as Pandas is built around the Dataframe, the Frictionless Data approach uses data packages consisting of a JSON table schema and a data URI. These schemata can be easily generated for any dataset and work well for a number of applications such as:

Validating new data with tools like Goodtables or tableschema-py
Building a data update interface with tools such as Handsontable JS
Creating declarative data processing pipelines that a front end can easily interact with via datapackages pipelines and kubernetes
Pushing data into various databases and repository tools such as CKAN datastore
Extending the schema to allow export to linked data formats such as IIIF

The talk will cover these use cases and compare with the approaches taken by other open-source data science / BI tools such as Datashape with ODO from Continuum and Superset from AirBnB. I will aim to demonstrate that that lightweight web standards like datapackages speed up the development process.

PyVideo

Frictionless Data, Frictionless Development edit

Description

Details