（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40067486

文字总结：本文讨论了将文本转换为机器学习应用程序的数字形式的各种方法，重点关注词嵌入。作者从词袋（BoW）开始解释，为文本中的每个单词分配一个索引，并将每个单词出现的次数添加到其对应的索引中。这将创建一个稀疏矩阵，其中每行代表一个文档，每列代表一个唯一的单词。这些矩阵通常用于较旧的信息检索 (IR) 系统。然而，作者指出，虽然 BoW 简单且令人惊讶地有效，但它缺乏对捕获单词之间语义关系的直接支持。为了解决这个限制，开发了 Word2Vec 等更新的方法，它可以生成单词的密集向量表示，从而可以更准确地建模语义关系。 Word2Vec 使用 Skip Gram 和负采样 (SGNS) 来学习单词的密集向量，从而改进相似性计算。通过分析上下文窗口中同时出现的单词，它创建了捕获单词之间有意义的关系的向量表示。尽管词嵌入取得了进步，但挑战仍然存在，包括可扩展性和处理语言的微妙复杂性。作者强调了继续开发新的先进技术的重要性。此外，谈话还涉及以下主题： - 降维（奇异值分解、潜在语义分析等） - BoW 和词嵌入之间的区别 - 余弦相似度计算 - 可扩展性问题，尤其是大型数据集和分布式系统 - 将较长文本表示为向量相关的挑战 - 用户反馈对于改进和调整系统的有用性 - 与 Spanner 等用于管理和扩展数据的替代技术进行比较。虽然文本乍一看似乎脱节，但其总体目标是说明文本表示技术的发展和演变，从基本的词袋到当前最先进的词嵌入。

One straightforward way to get started is to understand embedding without any AI/deep learning magic. Just pick a vocabulary of words (say, some 50k words), pick a unique index between 0 and 49,999 for each of the words, and then produce embedding by adding +1 to the given index for a given word each time it occurs in a text. Then normalize the embedding so it adds up to one.

Presto -- embeddings! And you can use cosine similarity with them and all that good stuff and the results aren't totally terrible.

The rest of "embeddings" builds on top of this basic strategy (smaller vectors, filtering out words/tokens that occur frequently enough that they don't signify similarity, handling synonyms or words that are related to one another, etc. etc.). But stripping out the deep learning bits really does make it easier to understand.

Those would really just be identifiers. I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.

The classic example is word embeddings such as word2vec, or GloVE, where due to the embeddings being meaningful in this way, one can see vector relationships such as "man - woman" = "king - queen".

> I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.

In this case each dimension is the presence of a word in a particular text. So when you take the dot product of two texts you are effectively counting the number of words the two texts have in common (subject to some normalization constants depending on how you normalize the embedding). Cosine similarity still works for even these super naive embeddings which makes it slightly easier to understand before getting into any mathy stuff.

You are 100% right this won't give you the word embedding analogies like king - man = queen or stuff like that. This embedding has no concept of relationships between words.

But that doesn't seem to be what you are describing in terms of using incrementing indices and adding occurrence counts.

If you want to create a bag of words text embedding then you set the number of embedding dimensions to the vocabulary size and the value of each dimension to the global count of the corresponding word.

Heh -- my explanation isn't the clearest I realize, but yes, it is BoW.

Eg fix your vocab of 50k words (or whatever) and enumerate it.

Then to make an embedding for some piece of text

1. initialize an all zero vector of size 50k 2. for each word in the text, add one to the index of the corresponding word (per our enumeration). If the word isn't in the 50k words in your vocabulary, then discard it 3. (optionally), normalize the embedding to 1 (though you don't really need this and can leave it off for the toy example). initialize an embedding (for a single text) as an all zero vector of size 50k

Are you talking about sentence/text chunk embeddings, or just embeddings in general?

If you need high quality text embeddings (e. g to use with a vector DB for text chunk retrieval), they they are going to come from the output of a language model, either a local one or using an embeddings API.

Other embeddings are normally going to be learnt in end-to-end fashion.

They're not, I get why you think that though.

They're making a vector for a text that's the term frequencies in the document.

It's one step simpler than tfidf which is a great starting point.

OK, sounds counter-intuitive, but I'll take your word for it!

It seems odd since the basis of word similarity captured in this type of way is that word meanings are associated with local context, which doesn't seem related to these global occurrence counts.

Perhaps it works because two words with similar occurrence counts are more likely to often appear close to each other than two words where one has a high count, and another a small count? But this wouldn't seem to work for small counts, and anyways the counts are just being added to the base index rather than making similar-count words closer in the embedding space.

Do you have any explanation for why this captures any similarity in meaning?

> rather than making similar-count words closer in the embedding space.

Ah I think I see the confusion here. They are describing creating an embedding of a document or piece of text. At the base, the embedding of a single word would just be a single 1. There is absolutely no help with word similarity.

The problem of multiple meanings isn't solved by this approach at all, at least not directly.

Talking about the "gravity of a situation" in a political piece makes the text a bit more similar to physics discussions about gravity. But most of the words won't match as well, so your document vector is still more similar to other political pieces than physics.

Going up the scale, here's a few basic starting points that were (are?) the backbone of many production text AI/ML systems.

1. Bag of words. Here your vector has a 1 for words that are present, and 0 for ones that aren't.

2. Bag of words with a count. A little better, now we've got the information that you said "gravity" fifty times not once. Normalise it so text length doesn't matter and everything fits into 0-1.

3. TF-IDF. It's not very useful to know that you said a common word a lot. Most texts do, what we care about is ones that say it more than you'd expect so we take into account how often the words appear in the entire corpus.

These don't help with words, but given how simple they are they are shockingly useful. They have their stupid moments, although one benefit is that it's very easy to debug why they cause a problem.

Are you saying it's pure chance that operations like "man - woman" = "king - queen" (and many, many other similar relationships and analogies) work?

If not please explain this comment to those of us ignorant in these matters :)

It’s not pure chance that the above calculus shakes out, but it doesn’t have to be that way. If you are embedding on a word by word level then it can happen, if it’s a little smaller or larger than word by word it’s not immediately clear what the calculation is doing.

But the main difference here is you get 1 embedding for the document in question, not an embedding per word like word2vec. So it’s something more like “document about OS/2 warp” - “wiki page for ibm” + “wiki page for Microsoft” = “document on windows 3.1”

I'm trying to understand this approach. Maybe I am expecting too much out of this basic approach, but how does this create a similarity between words with indices close to each other? Wouldn't it just be a popularity contest - the more common words have higher indices and vice versa? For instance, "king" and "prince" wouldn't necessarily have similar indices, but they are semantically very similar.

This is a simple example where it scores their frequency. If you scored every word by their frequency only you might have embeddings like this:

  act: [0.1]
  as:  [0.4]
  at:  [0.3]
  ...

That's a very simple 1D embedding, and like you said would only give you popularity. But say you wanted other stuff like its: Vulgarity, prevalence over time, whether its slang or not, how likely it is to start or end a sentence, etc. you would need more than 1 number. In text-embedding-ada-002 there are 1536 numbers in the array (vector), so it's like:

  act: [0.1, 0.1, 0.3, 0.0001, 0.000003, 0.003, ... (1536 items)]
  ...

The numbers don't mean anything in-and-of-themselves. The values don't represent qualities of the words, they're just numbers in relation to others in the training data. They're different numbers in different training data because all the words are scored in relation to each other, like a graph. So when you compute them you arrive at words and meanings in the training data as you would arrive at a point in a coordinate space if you subtracted one [x,y,z] from another [x,y,z] in 3D.

So the rage about a vector db is that it's a database for arrays of numbers (vectors) designed for computing them against each other, optimized for that instead of say a SQL or NoSQL which are all about retrieval etc.

So king vs prince etc. - When you take into account the 1536 numbers, you can imagine how compared to other words in training data they would actually be similar, always used in the same kinds of ways, and are indeed semantically similar - you'd be able to "arrive" at that fact, and arrive at antonyms, synonyms, their French alternatives, etc. but the system doesn't "know" that stuff. Throw in Burger King training data and talk about French fries a lot though, and you'd mess up the embeddings when it comes arriving at the French version of a king! You might get "pomme de terre".

King doesn’t need to appear commonly with prince. It just needs to appear in the same context as prince.

It also leaves out the old “tf idf” normalization of considering how common a word is broadly (less interesting) vs in that particular document. Kind of like a shittier attention. Used to make a big difference.

Yeah, that is a poorly written description. I think they meant that each word gets a unique index location into an array, and the value at that word's index location is incremented whenever the word occurs.

Is that really an embedding? I normally think of an embedding as an approximate lower-dimensional matrix of coefficients that operate on a reduced set of composite variables that map the data from a nonlinear to linear space.

You're right that what I described isn't what people commonly think about as embeddings (given we are more advanced now the above description), but broadly an embedding is anything (in nlp at least) that maps text into a fixed length vector. When you make embedding like this, the nice thing is that cosine similarity has an easy to understand similarity meaning: count the number of words two documents have in common (subject to some normalization constant).

Most fancy modern embedding strategies basically start with this and then proceed to build on top of it to reduce dimensions, represent words as vectors in their own right, pass this into some neural layer, etc.

A lot of people here are trying to describe to you that no, this is not at all the starting point of modern embeddings. This has none of the properties of embeddings.

What you're describing is an idea from the 90s that was a dead end. Bag of words representations.

It has no relationship to modern methods. It's based on totally different theory (bow instead of the distributional hypothesis).

There is no conceptual or practical path from what you describe to what modern embeddings are. It's horribly misleading.

There is no conceptual or practical path from what you describe to what modern embeddings are.

There certainly is. At least there is a strong relation between bag of word representations and methods like word2vec. I am sure you know all of this, but I think it's worth expanding a bit on this, since the top-level comment describes things in a rather confusing way.

In traditional Information Retrieval, two kinds of vectors were typically used: document vectors and term vectors. If you make a |D| x |T| matrix (where |D| is the number of documents and |T| is the number of terms that occur in all documents), we can go through a corpus and note in each |T|-length row for a particular the frequency of each term in that document (frequency here means the raw counts or something like TF-IDF). Each row is a document vector, each column a term vector. The cosine similarity between two document vectors will tell you whether two documents are similar, because similar documents are likely to have similar terms. The cosine similarity between two term vectors will tell you whether two terms are similar, because similar terms tend to occur in similar documents. The top-level comment seems to have explained document vectors in a clumsy way.

Over time (we are talking 70-90s), people have found that term vectors did not really work well, because documents are often too coarse-grained as context. So, term vectors were defined as |T| x |T| matrices where if you have such a matrix C, C[i][j] contains the frequency of how often the j-th term occurs in the context of the i-th term. Since this type of matrix is not bound to documents, you can choose the context size based on the goals you have in mind. For instance, you could only count terms that are within 10 (text) distance of the occurrences of the term i.

One refinement is that rather than raw frequencies, we can use some other measure. One issue with raw frequencies is that a frequent word like the will co-occur with pretty much every word, so it's frequency in the term vector is not particularly informative, but it's large frequency will have an outsized influence on e.g. dot products. So, people would typically use pointwise mutual information (PMI) instead. It's beyond the scope of a comment to explain PMI, but intuitively you can think of the PMI of two words to mean: how much more often do the words cooccur than chance? This will result in low PMIs for e.g. PMI(information, the) but a high PMI for PMI(information, retrieval). Then it's also common practice to replace negative PMI values by zero, which leads to PPMI (positive PMI).

So, what do we have now? A |T|x|T| matrix with PPMI scores, where each row (or column) can be used as a word vector. However, it's a bit unwieldy, because the vectors are large (|T|) and typically somewhat sparse. So people started to apply dimensionality reduction, e.g. by applying Singular Value Decomposition (SVD, I'll skip the details here of how to use it for dimensionality reduction). So, suppose that we use SVD to reduce the vector dimensionality to 300, we are left with a |T|x300 matrix and we finally have dense vectors, similar to e.g. word2vec.

Now, the interesting thing is that people have found that word2vec's skipgram with negative sampling (SGNS) is implicitly vectorizing a PMI-based word-context matrix [1], exactly like the IR folks were doing before. Conversely, if you matrix-multiply the word and context embedding matrices that come out of word2vec SGNS, you get an approximation of the |T|x|T| PMI matrix (or |T|x|C| if a different vocab is used for the context).

Summarized, there is a strong conceptual relation between bag-of-word representations of old days and word2vec.

Whether it's an interesting route didactically for understanding embeddings is up for debate. It's not like the mathematics behind word2vec are complex (understanding the dot product and the logistic function goes a long way) and understanding word2vec in terms of 'neural net building blocks' makes it easier to go from word2vec to modern architectures. But in an exhaustive course about word representations, it certainly makes sense to link word embeddings to prior work in IR.

[1] https://proceedings.neurips.cc/paper_files/paper/2014/file/f...

Eh, I disagree. When I began working in ML everything was about word2vec and glove and the state of the art for embedding documents was adding together all the word embeddings and it made no sense to me but it worked.

Learning about BoW and simple ways of convert text to fixed length vectors that can be used in ML algos clarified a whole for me, especially the fact that embeddings aren’t magic they are just a way to convert text to a fixed length vector.

BoW and tf-idf vectors are still workhorses for routine text classification tasks despite their limitations, so they aren’t really a dead end. Similarity a lot of things that follow BoW make a whole lot more sense if you think of them as addressing limitations of BoW.

Well, you've fooled yourself into thinking you understand something when you don't. I say this as someone with a PhD in the topic, who has taught many students, and published dozens of papers in the space.

The operation of adding BoW vectors together has nothing to do with the operation of adding together word embeddings. Well, aside from both nominally being addition.

It's like saying you understand what's happening because you can add velocity vectors and then you go on to add the binary vectors that represent two binary programs and expect the result to give you a program with the average behavior of both. Obviously that doesn't happen, you get a nonsense binary.

They may both be arrays of numbers but mathematically there's no relationship between the two. Thinking that there's a relationship between them leads to countless nonsense conclusions: the idea that you can keep adding word embeddings to create document embeddings like you keep adding BoWs, the notion that average BoWs mean the same thing as average word embeddings, the notion that normalizing BoWs is the same as normalizing word embeddings and will lead to the same kind of search results, etc. The errors you get with BoWs are totally different from the errors you get with word or sentence or document embeddings. And how you fix those errors is totally different.

No. Nothing at all makes sense about word embeddings from the point of BoW.

Also, yes BoW is a total dead end. They have been completely supplanted. There's never any case where someone should use them.

How does this enable cosine similarity usage? I don't get the link between incrementing a word's index by it's count in a text and how this ends up with words that have similar meaning to have a high cosine similarity value

I think they are talking about bag-of-words. If you apply a dimensionality reduction technique like SVD or even random projection on bag-of-words, you can effectively create a basic embedding. Check out latent semantic indexing / latent semantic analysis.

You're right, that approach doesn't enable getting embeddings for an individual word. But it would work for comparing similarity of documents - not that well of course, but it's a toy example that might feel more intuitive

Really appreciate you explaining this idea, I want to try this! It wasn't clear to me until I read the discussion that you meant that you'd have similarity of entire documents, not among words.

Yes! And that’s an oversight on my part — word embeddings are interesting but I usually deal with documents when doing nlp work and only deal with word embeddings when thinking about how to combine them into a document embedding.

Give it a shot! I’d grab a corpus like https://scikit-learn.org/stable/datasets/real_world.html#the... to play with and see what you get. It’s not going to be amazing, but it’s a great way to build some baseline intuition for nlp work with text that you can do on a laptop.

Without getting into any big debates about whether or not RAG is medium-term interesting or whatever, you can ‘pip install sentence-transformers faiss’ and just immediately start having fun. I recommend using straightforward cosine similarity to just crush the NYT’s recommender as a fun project for two reasons: there’s an API and plenty of corpus, and it’s like, whoa, that’s better than the New York Times.

He’s trying to sell a SaaS product (Pinecone), but he’s doing it the right way: it’s ok to be an influencer if you know what you’re taking about.

James Briggs has great stuff on this: https://youtube.com/@jamesbriggs

> crush the NYT’s recommender as a fun project for two reasons

Could you share what recommender you're referring to here, and how you can evaluate "crushing" it?

Sounds fun!

Given

  not because they’re sufficiently advanced technology indistinguishable from magic, but the opposite.

  Unlike LLMs, working with embeddings feels like regular deterministic code.

  Creating embeddings

I was hoping for a bit more than:

  They’re a bit of a black box

  Next, we chose an embedding model. OpenAI’s embedding models will probably work just fine.

I agree. The article was useful insofar as it detailed the steps they took to solve their problem clearly, and it's easy to see that many common problems are similar and could therefore be solved similarly, but I went in expecting more insight. How are the strings turned into arrays of numbers? Why does turning them into numbers that way lead to these nice properties?

Good call out! We think of this as a two part problem.

1. The intent of the user. Is it a description of the look of the icon or the utility of the icon? 2. How best to rank the results which is a combination of intent, CTR of past search queries, bootstrapping popularity via usage on open source projects etc.

- Charlie of v0.app

This is imo the worst part of embedding search.

Somehow Amazon continues to be the leader in muddy results which is a sign that it’s a huge problem domain and not easily fixable even if you have massive resources.

Wouldn’t it help to provide affordances guiding the user to submit a question rather than a keyword? Then, “Why are kings selected by primogeniture?” probably wouldn’t be near passages about measuring sticks in the embedding space. (Of course, this idea doesn’t work for icon search.)

I was reading this article and thinking about things like, in the case of doing transcription, if you heard the spoken word “sign” in isolation you couldn’t be sure whether it meant road sign, spiritual sign, +/- sign, or even the sine function. This seems like a similar problem where you pretty much require context to make a good guess, otherwise the best it could do is go off of how many times the word appears in the dataset right? Is there something smarter it could do?

Yeah, these can be cute, but they're not ideal. I think the user feedback mechanism could help naturally align this over time, but it would also be gameable. It's all interesting stuff

Hybrid searches are great, though I'm not sure they would help here. Neither 'crown' nor 'ruler' would come back from a text search for 'king,' right?

I bet if we put a better description into the embedding for 'ruler,' we'd avoid this. Something like "a straight strip or cylinder of plastic, wood, metal, or other rigid material, typically marked at regular intervals, to draw straight lines or measure distances." (stolen from a Google search). We might be able to ask a language model to look at the icon and give a good description we can put into the embedding.

It begs the question though, doesn't it...? Embeddings require a neural network or some reasonable facsimile to produce the embedding in the first place. Compression to a vector (a semantic space of some sort) still needs to happen – and that's the crux of the understanding/meaning. To just say "embeddings are cool let's use them" is ignoring the core problem of semantics/meaning/information-in-context etc. Knowing where an embedding came from is pretty damn important.

Embeddings live a very biased existence. They are the product of a network (or some algorithm) that was trained (or built) with specific data (and/or code) and assume particular biases intrinsically (network structure/algorithm) or extrinsically (e.g., data used to train a network) which they impose on the translation of data into some n-dimensional space. Any engineered solution always lives with such limitations, but with the advent of more and more sophisticated methods for the generation of them, I feel like it's becoming more about the result than the process. This strikes me as problematic on a global scale... might be fine for local problems but could be not-so-great in an ever changing world.

One thing I'm not sure of is how much of a larger bit of text should go into an embedding? I assume it's a trade off of context and recall, with one word not meaning much semantically, and the whole document being too much to represent with just numbers. Is there a sweet spot (e.g. split by sentence) or am I missing something here?

I learned how to use embeddings by building semantic search for the Bhagavad Gita. I simply saved the embeddings for all 700 verses into a big file which is stored in a Lambda function, and compared against incoming queries with a single query to OpenAI's embedding endpoint.

Shameless plug in case anyone wants to test it out - https://gita.pub

One of my biggest annoyances with the modern AI tooling hype is that you need to use a vector store for just working with embeddings. You don't.

The reason vector stores are important for production use-cases are mostly latency-related for larger sets of data (100k+ records), but if you're working on a toy project just learning how to use embeddings, you can compute cosine distance with a couple lines of numpy by doing a dot product of a normalized query vectors with a matrix of normalized records.

Best of all, it gives you a reason to use Python's @ operator, which with numpy matrices does a dot product.

100k records is still pretty small!

It feels a bit like the hype that happended with "big data". People ended up creating spark clusters to query a few million records. Or using Hadoop for a dataset you could process with awk.

Professionally I've only ever worked with dataset sizes in the region of low millions and have never needed specialist tooling to cope.

I assume these tools do serve a purpose but perhaps one that only kicks in at a scale approaching billions.

I've been in the "mid-sized" area a lot where Numpy etc cannot handle it, so I had to go to Postgres or more specialized tooling like Spark. But I always started with the simple thing and only moved up if it didn't suffice.

Similarly, I read how Postgres won't scale for a backend application and I should use Citus, Spanner, or some NoSQL thing. But that day has not yet arrived.

If I remember correctly, the simplest thing Numpy choked on was large sparse matrix multiplication. There are also other things like text search that Numpy won't help with and you don't want to do in Python if it's a large set.

Right on: I've used a single Postgres database on AWS to handle 1M+ concurrent users. If you're Google, sure, not gonna cut it, but for most people these things scale vertically a lot further than you'd expect (especially if, like me, you grew up in the pre-SSD days and couldn't get hundreds of gigs of RAM on a cloud instance).

Even when you do pass that point, you can often shard to achieve horizontal scalability to at least some degree, since the real heavy lifting is usually easy to break out on a per-user basis. Some apps won't permit that (if you've got cross-user joins then it's going to be a bit of a headache), but at that point you've at least earned the right to start building up a more complex stack and complicating your queries to let things grow horizontally.

Horizontal scaling is a huge headache, any way you cut it, and TBH going with something like Spanner is just as much of a headache because you have to understand its limitations extremely well if you want it to scale. It doesn't just magically make all your SQL infinitely scalable, things that are hard to shard are typically also hard to make fast on Spanner. What it's really good at is taking an app with huge traffic where a) all the hot queries would be easy to shard, but b) you don't want the complexity of adding sharding logic (+re-sharding, migration, failure handling, etc), and c) the tough to shard queries are low frequency enough that you don't really care if they're slow (I guess also d) you don't care that it's hella expensive compared to a normal Postgres or MySQL box). You still need to understand a lot more than when using a normal DB, but it can add a lot of value in those cases.

I can't even say whether or not Google benefits from Spanner, vs multiple Postgres DBs with application-level sharding. Reworking your systems to work with a horizontally-scaling DB is eerily similar to doing application-level sharding, and just because something is huge doesn't mean it's better with DB-level sharding.

The unique nice thing in Spanner is TrueTime, which enables the closest semblance of a multi-master DB by making an atomic clock the ground truth (see Generals' Problem). So you essentially don't have to worry about a regional failure causing unavailability (or inconsistency if you choose it) for one DB, since those clocks are a lot more reliable than machines. But there are probably downsides.

This sentiment is pretty common I guess. Outside of a niche, the massive scale for which a vast majority of the data tech was designed doesn't exist and KISS wins outright. Though I guess that's evolution, we want to test the limits in pursuit of grandeur before mastering the utility (ex. pyramids).

KISS doesn't get me employed though. I narrowly missed being the chosen candidate for a State job which called for Apache Spark experience. I missed two questions relating to Spark and "what is a parquet file?" but otherwise did great on the remaining behavioral questions (the hiring manager gave me feedback after requesting it). Too bad they did not have a question about processing data using command lines tools.

As an individual, I love the idea of pushing to simplify even further to understand these core concepts. For the ecosystem, I like that vector stores make these features accessible to environments outside of Python.

If you ask ChatGPT to give you a cosine similarity function that works against two arrays of floating numbers in any programming language you'll get the code that you need.

Here's one in JavaScript (my prompt was "cosine similarity function for two javascript arrays of floating point numbers"):

    function cosineSimilarity(vecA, vecB) {
        if (vecA.length !== vecB.length) {
            throw "Vectors do not have the same dimensions";
        }
        let dotProduct = 0.0;
        let normA = 0.0;
        let normB = 0.0;
        for (let i = 0; i < vecA.length; i++) {
            dotProduct += vecA[i] * vecB[i];
            normA += vecA[i] ** 2;
            normB += vecB[i] ** 2;
        }
        if (normA === 0 || normB === 0) {
            throw "One of the vectors is zero, cannot compute similarity";
        }
        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
    }

Vector stores really aren't necessary if you're dealing with less than a few hundred thousand vectors - load them up in a bunch of in-memory arrays and run a function like that against them using brute-force.

For posterity, OpenAI embeddings come pre-normalized so you can immediately dot-product.

Most embeddings providers do normalization by default, and SentenceTransformers has a normalize_embeddings parameter which does that. (it's a wrapper around PyTorch's F.normalize)

When I'm messing around, I normally have everything in a Pandas DataFrame already so I just add embeddings as a column and calculate cosine similarity on the fly. Even with a hundred thousand rows, it's fast enough to calculate before I can even move my eyes down on the screen to read the output.

I regret ever messing around with Pinecone for my tiny and infrequently used set ups.

Even in production my guess is most teams would be better off just rolling their own embedding model (huggingface) + caching (redis/rocksdb) + FAISS (nearest neighbor) and be good to go. I suppose there is some expertise needed, but working with a vector database vendor has major drawbacks too.

Great project and excellent initiative to learn about embeddings. Two possible avenues to explore more. Your system backend could be thought of as being composed of two parts: |Icons->Embedder->|PGVector|->Retriever->Display Result|

1. In the embedder part trying out different embedding models and/or vector dimensions to explore if the Recall@K & Precision@K for your data set (icons) improves. Models make a surprising amount of difference to the quality of the results. Try the MTEB Leaderboard for ideas on which models to explore.

2. In the Information Retriever part you can try a couple of approaches: a.after you retrieve from PGVector see if you can use a reranker like Cohere to get better results https://cohere.com/blog/rerank

b.You could try a "fusion ranking" similar to the one you do but structured such that 50% of the weight is for a plain old keyword search in the metadata and 50% is for the embedding based search

Finally something more interesting to noodle on - what if the embeddings were based on the icon images and the model knew how to search for a textual descriptions in the latent space?

Is there any easy way to run the embedding logic locally? Maybe even locally to the database? My understanding is that they’re hitting OpenAI’s API to get the embedding for each search query and then storing that in the database. I wouldn’t want my search function to be dependent on OpenAI if I could help it.

neat! one thing i’d really love tooling for: supporting multi user apps where each has their own siloed data and embeddings. i find myself having to set up databases from scratch for all my clients, which results in a lot of repetitive work. i’d love to have the ability one day to easily add users to the same db and let them get to embedding without having to have any knowledge going in

This is possible in supabase. You can store all the data in a table and restrict access with Row Level Security

You also have various ways to separate the data for indexes/performance

- use metadata filtering first (eg: filter by customer ID prior to running a semantic search). This is fast in postgres since its a relational DB

- pgvector supports partial indexes - create one per customer based on a customer ID column

- use table partitions

- use Foreign Data Wrappers (more involved but scales horizontally)

There are no extra costs other than the what we'd normally charge for Edge Function invocations (you get up to 500K in the free plan and 2M in the Pro plan)

We provide this functionality in Lantern cloud via our Lantern Extras extension: <https://github.com/lanterndata/lantern_extras>

You can generate CLIP embeddings locally on the DB server via:

  SELECT abstract,
       introduction,
       figure1,
       clip_text(abstract) AS abstract_ai,
       clip_text(introduction) AS introduction_ai,
       clip_image(figure1) AS figure1_ai
  INTO papers_augmented
  FROM papers;

Then you can search for embeddings via:

  SELECT abstract, introduction FROM papers_augmented ORDER BY clip_text(query) <=> abstract_ai LIMIT 10;

The approach significantly decreases search latency and results in cleaner code. As an added bonus, EXPLAIN ANALYZE can now tell percentage of time spent in embedding generation vs search.

The linked library enables embedding generation for a dozen open source models and proprietary APIs (list here: <https://lantern.dev/docs/develop/generate>, and adding new ones is really easy.

I have tried CLIP on my personal photo album collection and it worked really well there - I could write detailed scene descriptions of past road trips, and the photos I had in mind would pop up. Probably the model is better for everyday photos than for icons

This is a good call out. OpenAI embeddings were simple to stand up, pretty good, cheap at this scale, and accessible to everyone. I think that makes them a good starting point for many people. That said, they're closed-source, and there are open-source embeddings you can run on your infrastructure to reduce external dependencies.

If you're building an iOS app, I've had success storing vectors in coredata and using a tiny coreml model that runs on device for embedding and then doing cosine similarity.

They are embedded into a particular semantic vector space that is learned based on a model. Another feature vector could be hand rolled based on feature engineering, tidf ngrams etc. Embedding is typically distinct from feature engineering that is manual.

For an article extolling the benefits of embeddings for developers looking to dip their toe into the waters of AI it's odd they don't actually have an intro to embeddings or to vector databases. They just assume the reader already knows these concepts and dives on in to how they use them.

Sure many do know these concepts already but they're probably not the people wondering about a 'good starting point for the AI curious app developer'.

I have been saying similar things to my fellow technical writers ever since the ChatGPT explosion. We now have a tool that makes semantic search on arbitrary, diverse input much easier. Improved semantic search could make a lot of common technical writing workflows much more efficient. E.g. speeding up the mandatory research that you must do before it's even possible to write an effective doc.

My smooth brain might not understand this properly, but the idea is we generate embeddings, store them, then use retrieval each time we want to use them.

For simple things we might not need to worry about storing much, we can generate the embeddings and just cache them or send them straight to retrieval as an array or something...

The storing of embeddings seems the hard part, do I need a special database or PG extension? Is there any reason I can't store them as a blobs in SQlite if I don't have THAT much data, and I don't care too much about speed? Do embeddings generated ever 'expire'?

You'd have to update the embedding every time the data used to generate it changes. For example, if you had an embedding for user profiles and they updated their bio, you would want to make a new embedding.

I don't expect to have to change the embeddings for each icon all that often, so storing them seemed like a good choice. However, you probably don't need to cache the embedding for each search query since there will be long-tail ones that don't change that much.

The reason to use pgvector over blobs is if you want to use the distance functions in your queries.

Right like you could use it sort of like cache and send the blobs to OpenAI to use their similarity API, but you couldn't really use SQL to do cosine similarity operations?

My understanding of what's going on at a technical level might be a bit limited.

Yes.

Although if you really wanted to, and normalized your data like a good little Edgar F. Codd devotee, you could write something like this:

SELECT SUM(v.dot) / (SQRT(SUM(v.v1)) * SQRT(SUM(v.v2))) FROM (SELECT v1.dimension as dim, v1.value as v1, v2.value as v2, v1.value * v2.value as dot FROM vectors as v1 INNER JOIN vectors as v2 ON v1.dimension = v2.dimension WHERE v1.vector_id = "?" AND v2.vector_id = "?") as v;

This assumes one table called "vectors" with columns vector_id, dimension, and value; vector_id and dimension being primary. The inner query grabs two vectors as separate columns with some self-join trickery, computes the product of each component, and then the outer query computes aggregate functions on the inner query to do the actual cosine similarity.

No I have not tested this on an actual database engine, I probably screwed up the SQL somehow. And obviously it's easier to just have a database (or Postgres extension) that recognizes vector data as a distinct data type and gives you a dedicated cosine-similarity function.

A KV store is both good enough and highly performant. I use Redis for storing embeddings and expire them after a while. Unless you have a highly specialized use case it’s not economical to persistently store chunk embedding.

Redis also does have vector search capability as well. However, the most popular answer you’ll get here is to use Postgres (pgvectpr).

But why is that? I’m sure it’s the ‘best’ way to do things, but it also means more infrastructure which for simple apps isn’t worth the hassle.

I should use redis for queues but often I’ll just use a table in a SQLite database. For small scale projects I find it works fine, I’m wondering what an equivalent simple option for embeddings would be.

Re storing vectors in BLOB columns: ya, if it's not a lot of data and it's fast enough for you, then there's no problem doing it like that. I'd even just store then in JSON/npy files first and see how long you can get away with it. Once that gets too slow, then try SQLite/redis/valkey, and when that gets too slow, look into pgvector or other vector database solutions.

For SQLite specifically, very large BLOB columns might effect query performance, especially for large embeddings. For example, a 1536-dimension vector from OpenAI would take 1536 * 4 = 6144 bytes of space, if stored in a compact BLOB format. That's larger than SQLite default page size of 4096, so that extra data will overflow into overflow pages. Which again, isn't too big of a deal, but if the original table had small values before, then table scans can be slower.

One solution is to move it to a separate table, ex on an original `users` table, you can make a new `CREATE TABLE users_embeddings(user_id, embedding)` table and just LEFT JOIN that when you need it. Or you can use new techniques like Matryoshka embeddings[0] or scalar/binary quantization[1] to reduce the size of individual vectors, at the cost of lower accuracy. Or you can bump the page size of your SQLite database with `PRAGMA page_size=8192`.

I also have a SQLite extension for vector search[2], but there's a number of usability/ergonomic issues with it. I'm making a new one that I hope to release soon, which will hopefully be a great middle ground between "store vectors in a .npy files" and "use pgvector".

Re "do embeddings ever expire": nope! As long as you have access to the same model, the same text input should give the same embedding output. It's not like LLMs that have temperatures/meta prompts/a million other dials that make outputs non-deterministic, most embedding models should be deterministic and should work forever.

[0] https://huggingface.co/blog/matryoshka

[1] https://huggingface.co/blog/embedding-quantization

[2] https://github.com/asg017/sqlite-vss

I've been adding embeddings to every project I work on for the purpose of vector similarity searches.

I was just trying to order uber eats and wondering why they don't have a better search based off embeddings.

Almost finished building a feature on JSON Resume, that takes your hosted resume and WhoIsHiring job posts and uses embeddings to return relevant results -> https://registry.jsonresume.org/thomasdavis/jobs

I strongly agree with the title of the article. RAG is very interesting right now just as an example of how technology moves from being just fresh out of academia to being engineered and commoditized into regular out of the shelf tools. On the other hand I don't think it's that important to understand how embeddings are calculated, for the beginner it's more important to showcase why they work and why they enable simple reasoning like "queen = woman + (king - men)" and the possible use cases.

I’d love to build a suite of local tooling to play around with different embedding approaches.

I’ve had great results using SentenceTransformers for quick one-off tasks at work for unique data asks.

I’m curious about clustering within the embeddings and seeing what different approaches can yield and what applications they work best for.

If I have 50,000 historical articles and 5,000 new articles I apply SBERT and then k-means with N=20 I get great results in terms of articles about Ukraine, sports, chemistry, and nerdcore from Lobsters ending up in distinct clusters.

I’ve used DBSCAN for finding duplicate content, this is less successful. With the parameters I am using it is rare for there to be a false positives, but there aren’t that many true positives. I’m sure I could do do better if I tuned it up but I’m not sure if there is an operating point I’d really like.

Embeddings are indeed a good starting point. Next step is choosing the model and the database. The comments here have been taken over by database companies so I'm skeptical about the opinions. I wish MySQL had a cosine search feature built in

Embeddings have a special place in my heart since I learned about them 2 years ago. Working in SEO, it felt like everything finally "clicked" and I understood, on a lower level, how Google search actually works, how they're able to show specific content snippets directly on the search results page, etc. I never found any "SEO Guru" discussing this at all back then (maybe even now?), even though this was complete gold. It explains "topical authority" and gave you clues on how Google itself understands it.

This is where I got started too. Glove embedding stored in Postgres.

Pgvector is nice, and it's cool seeing quick tutorials using it. Back then, we only had cube, which didn't do cosine similarity indexing out of the box (you had to normalize vectors and use euclidean indexes) and only supported up to 100 dimensions. And there were maybe other inconveniences I don't remember, cause front page AI tutorials weren't using it.

PGvector is very nice indeed. And you get to store your vectors close to the rest of your data. I'm yet to understand the unique use case for dedicated vector dbs. It seems so annoying, having to query your vectors in a separate database without being able to easily join/filter based on the rest of your tables.

I stored ~6 million hacker news posts, their metadata, and the vector embeddings in a cheap 20$/month vm running pgvector. Querying is very fast. Maybe there's some penalty to pay when you get to the billion+ row counts, but I'm happy so far.

As I'm trying to work on some pricing info for PGVector - can you share some more info about the hacker news posts you've embedded?

* Which embedding model? (or number of dimensions) * When you say 6 million posts - it's just the URL of the post, title, and author, or do you mean you've also embedded the linked URL (be it HN or elsewhere)?

Cheers!

You can also store vectors or matrices in a split-up fashion as separate rows in a table, which is particularly useful if they're sparse. I've handled huge sparse matrix expressions (add, subtract, multiply, transpose) that way, cause numpy couldn't deal with them.

I think he is saying: embeddings are deterministic, so they are more predictable in production.

They’re still magic, with little explain ability or adaptability when they don’t work.

Can someone give a qualitative explanation of what the vector of a word with 2 unrelated meanings would look like compared to the vector of a synonym of each of those meanings?

If you think about it like a point on a graph, and the vectors as just 2D points (x,y), then the synonyms would be close and the unrelated meanings would be further away.

I'm guessing 2 dimensions isn't for this.

Here's a concrete example: "bow" would need to be close to "ribbon" (as in a bow on a present) and also close to "gun" (as a weapon that shoots a projectile), but "ribbon" and "gun" would seem to need be far from each other. How does something like word2vec resolve this? Any transitive relationship would seem to fall afoul of this.

Probably, but you might need something more sophisticated than cosine distance. For example, you might take a dataset of business letters, diary entries, and fiction stories and train some classifier on top of the embeddings of each of the three types of text, then run (embeddings --> your classifier) on new text. But at that point you might just want to ask an LLM directly with a prompt like - "Classify the style of the following text as business, personal, or fiction: $YOUR TEXT$"

ah pgvector is kind of annoying to start with, you have to set it up and maintain, and then it starts falling apart when you have more vectors

Can you elaborate more on the falling apart? I can see pgvector being intimidating for users with no experience standing up a DB, but I don't see how Postgres or pgvector would fall apart. Note, my reason for asking is I'm planning on going all in with Postgres, so pgvector makes sense for me.

on the other hand, if you have postgres already, it may be easier to add pgvector than to add another dependency to your stack (especially if you are using something like supabase)

another benefit is that you can easily filter your embeddings by other field, so everything is kept in one place and could help with perfomance

it's a good place to start in those cases and if it is successful and you need extreme performance you can always move to other specialized tools like qdrant, pinecone or weaviate which were purpose-built for vectors

What is "more vectors"? How many are we talking about? We've been using pgvector in production for more than 1 year without any issues. We dont have a ton of vectors, less than 100,000, and we filter queries by other fields so our total per cosine function is probably more like max of 5000. Performance is fine and no issues.

（评论） (comments)

Creating embeddings

（评论）
(comments)