![]() |
|
![]() |
| I worked with a client that implemented their own Python version of this to deduplicate citizen entries in a big french gov database. It worked well.
Of course, nowaday I would probably just tell them to use datasketch (https://pypi.org/project/datasketch/). With this trip to memory lane, I looked around a little, and noticed people are still creating new stuff on the topic. E.G: https://pypi.org/project/rensa/ Which is basically a more specialized but faster version of datasketch minhash, written in rust, with a little python on top. |
![]() |
| The Fellegi Sunter model is able to estimate the importance of different types of information from the data itself (i.e. unsupervised learning).
For instance, a match on a date of birth column lends a greater weight of evidence in favour of a match than a match on first name (since dob has higher cardinality). The method is also able to estimate weights for fuzzy matches (how much evidence in favour of a match is close match on dob with one character difference), and also how much evidence against a match a mismatch is. For instance, if you have very high data quality on gender, then a match on gender doesn't tell you much, but a mismatch on gender is quite strong evidence against the idea two records match. I have a blog post here that delves into this a bit more: https://www.robinlinacre.com/fellegi_sunter_accuracy/ |
![]() |
| The performance is pretty good because the prediction is ultimately just adding up match weights. Much of the performance is dictated by
(1) The blocking approach you choose (how wide you cast the net in searching for matches. This is actually somewhere were minhash can be used in conjunction (2) whether you choose to use complex fuzzy matching functions and how many - this is chosen by the user. There's some benchmarking results here: https://www.robinlinacre.com/fast_deduplication/ Overall it's an approach which makes a pretty good trade off between speed and accuracy. That's why it's used by many national stats institutes (UK, US, Aus, Germany etc.) - because it's capable of working on country-population sized datasets. |
![]() |
| a little backstory (that seems to be missing from the blog post). this is a technique that was invented in the early days of google for deduplicating the crawl set (turns out that building an llm and building a plain ol' web text index look awfully similar; i wonder why that is?!). you can read about it in depth in jeffrey ullman's free book "mining massive datasets" which describes many of the really cool and impressive techniques that were used to prepare an index build for the entire internet when such a thing was really hard.
you can find the relevant material for free by searching "chapter 3 pdf mmds ullman" enjoy! edit: oh no! i'm wrong! according to wikipedia it was invented at dec for altavista. https://en.wikipedia.org/wiki/MinHash either way there's a nice description in the ullman book and they do describe how it was used at google as well. |
![]() |
| As I got into Minhash & its variants I found it hard to wrap my head around this, so I'm building an online visualizer
https://websla.sh/tools/minhash It's not really complete, I'd like to show Jaccard Similarity calculations among other things, but this already allows you to enter multiple strings and see with your own eyes what a "minhash" actually is. |
![]() |
| This is one of those techniques I don't understand when I read it as text, but will instantly absorb when I have a working code example and pass some of my own data through it a few times, inspecting the working process.
I first learned about this technique from Douglas Eck (https://research.google/people/douglas-eck/), who used it for song clustering at Google. I remember him saying something about hashing and random vectors, and was puzzled because I figured that less random optimiztion would work better. |
![]() |
| Hashing or tiny neural nets combined with a Vector Search engine with Tanimoto/Jaccard is a very common deduplication strategy for large datasets. It might be wiser than using linear-complexity MapReduce operations.
There is a nice Google project using 0.5 M parameter RETSim model and the USearch engine for that: https://github.com/google/unisim |
![]() |
| You can use cosine similarity with embedding vectors of different lengths (or better, the vectors have all the same length, but they are sparse with most components being 0). |
![]() |
| You can use min hash to avoid the full O(N^2) distance matrix, but with just 600000 items you might just brute force compute the full matrix for simiplicity. What's your time budget? |
![]() |
| If you think two items cover the same highly similar _keywords_ then Jaccard distance should work.
If you think two items share highly similar _text_ then give Levenshtein distance a try. |
![]() |
| Do the minhash-based algorithms in the article give exactly the same results (but faster) as pairwise calculation of J(a, b), or are the minhash-based results approximate? |
![]() |
| I had to do this with sport teams but I used levenshtein for the names. I ended up creating a vector with other data (country, gender) and using that vector to calculate the distance (similarity). |
The only complication is that you then need to choose a suitable T-Norm / T-Conorm pair, which express the concept of intersection and union for fuzzy sets, and there's an infinite family of them. But that's a good thing, since you get to pick the pair with your desired semantics.
I wrote about this ([0][1]) in the context of validating medical image segmentations when both the segmentation and ground truth are probabilistic/fuzzy rather than binary masks.
Otherwise what most people do instead is to simply threshold at 0.5 to obtain binary sets for use with the binary variants of the jaccard / dice coefficients. Which apparently decreases the precision of your validation operator by 2 orders of magnitude. It's like, you publish your algorithm claiming it's better than SOTA by 0.001, ignoring the fact that your validation operator has an error margin of 0.1 ...
[0] https://link.springer.com/chapter/10.1007/978-3-319-46723-8_...
[1] https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...