Data Deduplication using Locality Sensitive Hashing

YouTube

Description

Finding documents that are almost exact duplicates is difficult. Exact hashing algorithms do not work and pairwise comparisons do not scale. Locality Sensitive Hashing (LSH) is a scalable method for detecting near duplicate content that allows computation to be exchanged for accuracy. I will present the theoretical side of LSH and an open source Python implementation of the technique.

PyVideo

Data Deduplication using Locality Sensitive Hashing

Description

Details