Description
Being right all the time isn't necessarily the best idea. This talk examines how to count distinct items from a firehose of data, how to determine if we've seen a given item before, and why absolute accuracy may be impractical when doing so.
Probabilistic data structures trade accuracy for approximate results, speed and economy of resources. They provide fast, scalable solutions to problems such as counting likes on social media posts, or determining which articles on a website a user has previously read.
I'll introduce the Hyperloglog and Bloom Filter, explain how they work at a high level, and demonstrate different ways in which each can be leveraged in Python.
A GitHub repo to accompany this talk can be found at https://github.com/simonprickett/python-probabilistic-data-structures
Slides: https://simonprickett.dev/no_maybe_and_close_enough_slides.pdf