Description
The first weekend of October 2015 my company bought an advert during the first episode of "Downton Abbey" on Sunday evening. It was so successful that the website went down for half an hour. We wanted to look at the analytics and the data to estimate the impact. But they were having a very hard weekend too: the replica of the production database we used was unreachable and the only person who knew how to fix it was on a plane. Monday really was a memorable day.
This session is about sharing some life experience and best practices around Data Engineering. Attendants should have some previous understanding of data and tech in business. Attendants should leave with an understanding of on-call practices and with some quick win actions to take.
What does it mean to be on call?
How do you make sure that the phone rings as little as possible?
- Fixing versus Root Cause Analysis.
- Systems break at junctures.
- Especially if the juncture is with a third party.
Why and when is it worth reacting to errors as soon as they happen?
- External Services.
- Increasing Business Trust.
- Allowing others to build on solid ground.
How do you make sure the phone rings when it should?
- Alerting tools: emails, chat, specialised applications like PagerDuty, OpsGenie and Twilio
- Monitoring systems
- Monitoring data (Data Quality) as a continuous early warning system.