Description
Finding documents that are almost exact duplicates is difficult. Exact hashing algorithms do not work and pairwise comparisons do not scale. Locality Sensitive Hashing (LSH) is a scalable method for detecting near duplicate content that allows computation to be exchanged for accuracy. I will present the theoretical side of LSH and an open source Python implementation of the technique.