Description
A common problem in Data Engineering is how to create a platform capable both of importing and exporting tabular data in numerous formats and of maintaining a change history of the data while users update and query it.
Tools like Trifacta Google Cloud Dataprep provide a turnkey solution to part of the pipeline but the open source Frictionless Data tools from OKFN can provide a simpler subset of these features tailored to your requirements.
Just as Pandas is built around the Dataframe, the Frictionless Data approach uses data packages consisting of a JSON table schema and a data URI. These schemata can be easily generated for any dataset and work well for a number of applications such as:
- Validating new data with tools like Goodtables or tableschema-py
- Building a data update interface with tools such as Handsontable JS
- Creating declarative data processing pipelines that a front end can easily interact with via datapackages pipelines and kubernetes
- Pushing data into various databases and repository tools such as CKAN datastore
- Extending the schema to allow export to linked data formats such as IIIF
The talk will cover these use cases and compare with the approaches taken by other open-source data science / BI tools such as Datashape with ODO from Continuum and Superset from AirBnB. I will aim to demonstrate that that lightweight web standards like datapackages speed up the development process.