（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40872438

本文讨论了基于集合的指标的使用，特别是 Jaccard 相似度（谷本系数）和 F1 分数（骰子系数），超出了它们在二进制数据中的传统用途。这些指标还可以与模糊集一起使用，允许通过适当的 T-Norm/T-Conorm 对进行非二进制数据的交集和并集。作者强调了与通过 0.5 阈值简化二进制数据以获得二进制集相比，该方法在医学图像分割验证过程中的优势。他们通过使用 PostgreSQL 中类似 minhash 的系统进行新闻项重复数据删除的问题进一步说明了其实现。通过将文本字段连接并拆分为 ngram、创建随机数组并执行 minhashing，该方法旨在将大型数据集缩减为较小的大小，同时保留相似性信息。最终，用户打算根据两行落入相同或接近相同的 minhash 桶的频率来确定两行相似的可能性，并相应地计算 Jaccard 或余弦相似度。

One thing many people miss about set based metrics like the jaccard similarity (aka Tanimoto coefficient) and F1 score (aka Dice coefficient), is that they can also be used identically with fuzzy sets.

The only complication is that you then need to choose a suitable T-Norm / T-Conorm pair, which express the concept of intersection and union for fuzzy sets, and there's an infinite family of them. But that's a good thing, since you get to pick the pair with your desired semantics.

I wrote about this ([0][1]) in the context of validating medical image segmentations when both the segmentation and ground truth are probabilistic/fuzzy rather than binary masks.

Otherwise what most people do instead is to simply threshold at 0.5 to obtain binary sets for use with the binary variants of the jaccard / dice coefficients. Which apparently decreases the precision of your validation operator by 2 orders of magnitude. It's like, you publish your algorithm claiming it's better than SOTA by 0.001, ignoring the fact that your validation operator has an error margin of 0.1 ...

[0] https://link.springer.com/chapter/10.1007/978-3-319-46723-8_...

[1] https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...

I worked with a client that implemented their own Python version of this to deduplicate citizen entries in a big french gov database. It worked well.

Of course, nowaday I would probably just tell them to use datasketch (https://pypi.org/project/datasketch/).

With this trip to memory lane, I looked around a little, and noticed people are still creating new stuff on the topic. E.G:

https://pypi.org/project/rensa/

Which is basically a more specialized but faster version of datasketch minhash, written in rust, with a little python on top.

The Fellegi Sunter model is able to estimate the importance of different types of information from the data itself (i.e. unsupervised learning).

For instance, a match on a date of birth column lends a greater weight of evidence in favour of a match than a match on first name (since dob has higher cardinality).

The method is also able to estimate weights for fuzzy matches (how much evidence in favour of a match is close match on dob with one character difference), and also how much evidence against a match a mismatch is.

For instance, if you have very high data quality on gender, then a match on gender doesn't tell you much, but a mismatch on gender is quite strong evidence against the idea two records match.

I have a blog post here that delves into this a bit more: https://www.robinlinacre.com/fellegi_sunter_accuracy/

It doesn't sound like the approaches are incompatible. You can use minhash LSH to search a large set and get a top-k list for any individual, then use a weighted average with penalty rules to decide which of those qualifies as a dupe or not. Weighted minhash can also be used to efficiently add repeats to give some additional weighting.

The performance is pretty good because the prediction is ultimately just adding up match weights. Much of the performance is dictated by

(1) The blocking approach you choose (how wide you cast the net in searching for matches. This is actually somewhere were minhash can be used in conjunction

(2) whether you choose to use complex fuzzy matching functions and how many - this is chosen by the user.

There's some benchmarking results here: https://www.robinlinacre.com/fast_deduplication/

Overall it's an approach which makes a pretty good trade off between speed and accuracy. That's why it's used by many national stats institutes (UK, US, Aus, Germany etc.) - because it's capable of working on country-population sized datasets.

Holy synchronicity Batman! I just implemented a minhash system that some may find interesting.

The problem is to find (pseudo) inverses of many different proper submatrices of a big square matrix.

Certain matrix identities (Woodbury, banachiewicz) allow you to update the inverse of a “close” submatrix to cheaply calculate a new one.

So you store the inverses you’ve calculated already, with their row/column indices as keys. Then for each new submatrix you want to find a close already-computed inverse to update from.

This is solved with minhashing. You minwise hash the indices so that close matrices are likely to have the same hash.

In my particular implementation I did a “multi-resolution” hash so that I can change the selectiveness of the search as the number of prior computed inverses grows

a little backstory (that seems to be missing from the blog post). this is a technique that was invented in the early days of google for deduplicating the crawl set (turns out that building an llm and building a plain ol' web text index look awfully similar; i wonder why that is?!). you can read about it in depth in jeffrey ullman's free book "mining massive datasets" which describes many of the really cool and impressive techniques that were used to prepare an index build for the entire internet when such a thing was really hard.

you can find the relevant material for free by searching "chapter 3 pdf mmds ullman"

enjoy!

edit: oh no! i'm wrong! according to wikipedia it was invented at dec for altavista. https://en.wikipedia.org/wiki/MinHash either way there's a nice description in the ullman book and they do describe how it was used at google as well.

As I got into Minhash & its variants I found it hard to wrap my head around this, so I'm building an online visualizer

https://websla.sh/tools/minhash

It's not really complete, I'd like to show Jaccard Similarity calculations among other things, but this already allows you to enter multiple strings and see with your own eyes what a "minhash" actually is.

In order to be able to spy on someone’s phone entirely having access to their WhatsApp and seeing every incoming/outgoing calls and messages, I will recommend the service of professional hacker ( : ( TECHSPYMAX AT GM A1L C O M ) to successfully help you spy on the target’s phone just like he did for me when I was so desperate to look into my cheating spouse’s phone to find the truth. I was lucky to come across them and they got me full access to my spouse’s phone without her knowing and I was able to monitor the phone remotely, I saw all the text messages, WhatsApp messages, social media accounts including pictures and her locations. They are legit and reliable. Relay all your problems to them and you will be assisted good. You can reach them via the MA I L provided contact information above;

This is one of those techniques I don't understand when I read it as text, but will instantly absorb when I have a working code example and pass some of my own data through it a few times, inspecting the working process.

I first learned about this technique from Douglas Eck (https://research.google/people/douglas-eck/), who used it for song clustering at Google. I remember him saying something about hashing and random vectors, and was puzzled because I figured that less random optimiztion would work better.

i think the key insight, at least for me, is that if you break things into piles of lots of little pieces and then you come up with n different ways of ordering those piles, similar things are going to have the same piece show up on top across those orderings.

add in the banding stuff and some simple probability and voila! a cheap and embarrassingly parallel way to approximate jaccard similarity across huge datasets.

Hashing or tiny neural nets combined with a Vector Search engine with Tanimoto/Jaccard is a very common deduplication strategy for large datasets. It might be wiser than using linear-complexity MapReduce operations.

There is a nice Google project using 0.5 M parameter RETSim model and the USearch engine for that: https://github.com/google/unisim

I have this problem right now in postgres. I have 600000 feed_items with the schema (feed_item_id uuid, author varchar, content text, guid varchar, link varchar, title varchar, summary text, feed_id integer) The content and summary columns especially for some of the news items are highly similar but not equal. For any given 2 such news items, I am trying to cut them down to 1. Any ideas?

I've implemented a minhash-like system in Bigquery, and was able to calculate cosine similarities (within a certain similarity threshold) between all Stack Overflow in reasonable time. To give you a broad idea of the procedure:

1. Concatenate and split all text fields into an array of ngrams (2...n chars)

2. Declare two global arrays A & B of length-n and fill them with random 32-64 bit integers

3. Hash each ngram to a 32-64 bit integer, multiply the resulting hash by each random value in array A, and modulo the resulting output with each random value in array B. Take the minimum value. Per row, your aim is to end up with an array of 'minhashed' integers of equal length to the arrays in step 2. If you declare the global arrays to be length of 64, the minhashed array for each row will also be of length 64.

4. Bucketise the resulting hash arrays by summing N consecutive minhash values using a window function (e.g. sum each 4 consecutive rows)

If all went well, you can now unnest these arrays (call them 'source rows') and self join the dataset on each bucketed minhash value (resulting in an additional column of 'target rows'). Group by source, target columns and count the occurances to get an indication of how likely two rows are similar.

In essence, the more often two items hash to a similar bucket, the more similar they are, and it's up to you to define the cut-off from where to calculate actual pairwise Jaccard (or cosine) similarities between items.

love this and have been using tf/idf for embeddings and various measures of similarity for some personal pet projects. one thing i came across in my research is that cosine similarity was more useful for vectors of different lengths and that euclidean distance was useful for vectors of similar length but simon alludes to a same-length requirement. i’m not formally trained in this area so i was hoping someone could shed some light on this for me.

You can use cosine similarity with embedding vectors of different lengths (or better, the vectors have all the same length, but they are sparse with most components being 0).

You can use min hash to avoid the full O(N^2) distance matrix, but with just 600000 items you might just brute force compute the full matrix for simiplicity. What's your time budget?

If you think two items cover the same highly similar _keywords_ then Jaccard distance should work.

If you think two items share highly similar _text_ then give Levenshtein distance a try.

In case the author is following this:

Typo:

It’s worth noticing that this definition is not in general transitive: we may well have three documents A,B,C such that S(A,B)≥S_{crit} and S(B,C)≥S_{crit} but S(A,B)

~~That last S(A,B) should be S(A,C).~~

~~(Unless I'm very much mistaken ... and I could be)~~

As a document clustering / dataset deduplication technique, how well does "throwing ML at the problem" (e.g. using a pre-trained LLM encoder to generate vector embeddings of your documents, and then sticking those vectors into a vector DB and doing k-means to cluster the vectors) compare, quality-wise and performance-wise, to these simpler discrete-algorithm methods?

Using an LLM is just one of the ways to generate embedding. To do k-means you still need to pick a distance function, like jaccard; k-means is probably not ideal for near duplicates, and you can use min-hash to speed up k-means as a pre-pass.

I don't think the vector DB adds much. You could use it to speed up the lookup of the min-hash sketches if you have hundreds of millions of documents, but it is probably overkill.

I was picturing doing the deduplication as a map-reduce process over a huge (petabytes) dataset, where every worker is blind to what embeddings other workers have already generated. In such a case, shoving the embeddings generated by each worker into a shared vector DB, and having it (maybe incrementally) clustering the vectors as it receives them, would be acting as the "reduce" step.

I have seen it work better than LSH.

Each time you embed a document you search for approximate nearest neighbours before adding it, so it is O(N) like MinHash. Vector indexes like HNSW and PQ have better performance/quality tradeoffs than SimHash LSH which is the analogue of MinHash for cosine distance.

The quality depends on what you mean by near duplicate and the embedding model you use. Current models work well, and if you have labelled data you can fine tune them to be better.

The main drawback is the additional cost of embedding all the documents, especially for longer documents. But this cost has dropped really quickly with smaller models, better optimisations, and faster hardware.

If this is true, then why do you think that — as the OP article states — the developers of GPT3 chose to use non-ML-based techniques to deduplicate their dataset, when they would be the most equipped to use ML-based approaches?

Just the pure compute cost of needing to run an ML encoder over petabytes of data?

Or maybe because for their use-case — eliminating redundancy to reduce total dataset size and therefore training time — a non-domain-specific vectorization with a high-false-negative cluster-discovery rate was acceptable, because it just meant they'd "compress" the dataset slightly less well, and so get slightly more training time? (At the expense of increased bias in training toward the saliency of the features that weren't dedup'ed out; but that was going to happen regardless, and they likely already had a fully-general technique later in the pipeline for countering that.)

Do the minhash-based algorithms in the article give exactly the same results (but faster) as pairwise calculation of J(a, b), or are the minhash-based results approximate?

there was a really simple string sort by similarity oneliner I can't find in my history, but the basic idea was to take a wordlist, split each word in half, reverse the halves, concatenate them back together, sort the resulting strings, then flip the strings back to their original words.

this simple shuffle was a poor man's similarity sort that I used to find sound-alike words. the thing about the quality of similarity is that it's hard to know for sure if it worked, but this was sufficient.

I had to do this with sport teams but I used levenshtein for the names. I ended up creating a vector with other data (country, gender) and using that vector to calculate the distance (similarity).

（评论） (comments)

（评论）
(comments)