![]() |
|
![]() |
|
They're not, I get why you think that though. They're making a vector for a text that's the term frequencies in the document. It's one step simpler than tfidf which is a great starting point. |
![]() |
|
There is no conceptual or practical path from what you describe to what modern embeddings are. There certainly is. At least there is a strong relation between bag of word representations and methods like word2vec. I am sure you know all of this, but I think it's worth expanding a bit on this, since the top-level comment describes things in a rather confusing way. In traditional Information Retrieval, two kinds of vectors were typically used: document vectors and term vectors. If you make a |D| x |T| matrix (where |D| is the number of documents and |T| is the number of terms that occur in all documents), we can go through a corpus and note in each |T|-length row for a particular the frequency of each term in that document (frequency here means the raw counts or something like TF-IDF). Each row is a document vector, each column a term vector. The cosine similarity between two document vectors will tell you whether two documents are similar, because similar documents are likely to have similar terms. The cosine similarity between two term vectors will tell you whether two terms are similar, because similar terms tend to occur in similar documents. The top-level comment seems to have explained document vectors in a clumsy way. Over time (we are talking 70-90s), people have found that term vectors did not really work well, because documents are often too coarse-grained as context. So, term vectors were defined as |T| x |T| matrices where if you have such a matrix C, C[i][j] contains the frequency of how often the j-th term occurs in the context of the i-th term. Since this type of matrix is not bound to documents, you can choose the context size based on the goals you have in mind. For instance, you could only count terms that are within 10 (text) distance of the occurrences of the term i. One refinement is that rather than raw frequencies, we can use some other measure. One issue with raw frequencies is that a frequent word like the will co-occur with pretty much every word, so it's frequency in the term vector is not particularly informative, but it's large frequency will have an outsized influence on e.g. dot products. So, people would typically use pointwise mutual information (PMI) instead. It's beyond the scope of a comment to explain PMI, but intuitively you can think of the PMI of two words to mean: how much more often do the words cooccur than chance? This will result in low PMIs for e.g. PMI(information, the) but a high PMI for PMI(information, retrieval). Then it's also common practice to replace negative PMI values by zero, which leads to PPMI (positive PMI). So, what do we have now? A |T|x|T| matrix with PPMI scores, where each row (or column) can be used as a word vector. However, it's a bit unwieldy, because the vectors are large (|T|) and typically somewhat sparse. So people started to apply dimensionality reduction, e.g. by applying Singular Value Decomposition (SVD, I'll skip the details here of how to use it for dimensionality reduction). So, suppose that we use SVD to reduce the vector dimensionality to 300, we are left with a |T|x300 matrix and we finally have dense vectors, similar to e.g. word2vec. Now, the interesting thing is that people have found that word2vec's skipgram with negative sampling (SGNS) is implicitly vectorizing a PMI-based word-context matrix [1], exactly like the IR folks were doing before. Conversely, if you matrix-multiply the word and context embedding matrices that come out of word2vec SGNS, you get an approximation of the |T|x|T| PMI matrix (or |T|x|C| if a different vocab is used for the context). Summarized, there is a strong conceptual relation between bag-of-word representations of old days and word2vec. Whether it's an interesting route didactically for understanding embeddings is up for debate. It's not like the mathematics behind word2vec are complex (understanding the dot product and the logistic function goes a long way) and understanding word2vec in terms of 'neural net building blocks' makes it easier to go from word2vec to modern architectures. But in an exhaustive course about word representations, it certainly makes sense to link word embeddings to prior work in IR. [1] https://proceedings.neurips.cc/paper_files/paper/2014/file/f... |
![]() |
|
Really appreciate you explaining this idea, I want to try this! It wasn't clear to me until I read the discussion that you meant that you'd have similarity of entire documents, not among words.
|
![]() |
|
Yes! And that’s an oversight on my part — word embeddings are interesting but I usually deal with documents when doing nlp work and only deal with word embeddings when thinking about how to combine them into a document embedding. Give it a shot! I’d grab a corpus like https://scikit-learn.org/stable/datasets/real_world.html#the... to play with and see what you get. It’s not going to be amazing, but it’s a great way to build some baseline intuition for nlp work with text that you can do on a laptop. |
![]() |
|
Without getting into any big debates about whether or not RAG is medium-term interesting or whatever, you can ‘pip install sentence-transformers faiss’ and just immediately start having fun. I recommend using straightforward cosine similarity to just crush the NYT’s recommender as a fun project for two reasons: there’s an API and plenty of corpus, and it’s like, whoa, that’s better than the New York Times. He’s trying to sell a SaaS product (Pinecone), but he’s doing it the right way: it’s ok to be an influencer if you know what you’re taking about. James Briggs has great stuff on this: https://youtube.com/@jamesbriggs |
![]() |
|
> crush the NYT’s recommender as a fun project for two reasons Could you share what recommender you're referring to here, and how you can evaluate "crushing" it? Sounds fun! |
![]() |
|
Yeah, these can be cute, but they're not ideal. I think the user feedback mechanism could help naturally align this over time, but it would also be gameable. It's all interesting stuff
|
![]() |
|
I learned how to use embeddings by building semantic search for the Bhagavad Gita. I simply saved the embeddings for all 700 verses into a big file which is stored in a Lambda function, and compared against incoming queries with a single query to OpenAI's embedding endpoint. Shameless plug in case anyone wants to test it out - https://gita.pub |
![]() |
|
Great project and excellent initiative to learn about embeddings.
Two possible avenues to explore more.
Your system backend could be thought of as being composed of two parts:
|Icons->Embedder->|PGVector|->Retriever->Display Result| 1. In the embedder part trying out different embedding models and/or vector dimensions to explore if the Recall@K & Precision@K for your data set (icons) improves. Models make a surprising amount of difference to the quality of the results. Try the MTEB Leaderboard for ideas on which models to explore. 2. In the Information Retriever part you can try a couple of approaches: a.after you retrieve from PGVector see if you can use a reranker like Cohere to get better results https://cohere.com/blog/rerank b.You could try a "fusion ranking" similar to the one you do but structured such that 50% of the weight is for a plain old keyword search in the metadata and 50% is for the embedding based search Finally something more interesting to noodle on - what if the embeddings were based on the icon images and the model knew how to search for a textual descriptions in the latent space? |
![]() |
|
There are no extra costs other than the what we'd normally charge for Edge Function invocations (you get up to 500K in the free plan and 2M in the Pro plan)
|
![]() |
|
We provide this functionality in Lantern cloud via our Lantern Extras extension: <https://github.com/lanterndata/lantern_extras> You can generate CLIP embeddings locally on the DB server via:
Then you can search for embeddings via:
The approach significantly decreases search latency and results in cleaner code.
As an added bonus, EXPLAIN ANALYZE can now tell percentage of time spent in embedding generation vs search.The linked library enables embedding generation for a dozen open source models and proprietary APIs (list here: <https://lantern.dev/docs/develop/generate>, and adding new ones is really easy. |
![]() |
|
If you're building an iOS app, I've had success storing vectors in coredata and using a tiny coreml model that runs on device for embedding and then doing cosine similarity.
|
![]() |
|
Re storing vectors in BLOB columns: ya, if it's not a lot of data and it's fast enough for you, then there's no problem doing it like that. I'd even just store then in JSON/npy files first and see how long you can get away with it. Once that gets too slow, then try SQLite/redis/valkey, and when that gets too slow, look into pgvector or other vector database solutions. For SQLite specifically, very large BLOB columns might effect query performance, especially for large embeddings. For example, a 1536-dimension vector from OpenAI would take 1536 * 4 = 6144 bytes of space, if stored in a compact BLOB format. That's larger than SQLite default page size of 4096, so that extra data will overflow into overflow pages. Which again, isn't too big of a deal, but if the original table had small values before, then table scans can be slower. One solution is to move it to a separate table, ex on an original `users` table, you can make a new `CREATE TABLE users_embeddings(user_id, embedding)` table and just LEFT JOIN that when you need it. Or you can use new techniques like Matryoshka embeddings[0] or scalar/binary quantization[1] to reduce the size of individual vectors, at the cost of lower accuracy. Or you can bump the page size of your SQLite database with `PRAGMA page_size=8192`. I also have a SQLite extension for vector search[2], but there's a number of usability/ergonomic issues with it. I'm making a new one that I hope to release soon, which will hopefully be a great middle ground between "store vectors in a .npy files" and "use pgvector". Re "do embeddings ever expire": nope! As long as you have access to the same model, the same text input should give the same embedding output. It's not like LLMs that have temperatures/meta prompts/a million other dials that make outputs non-deterministic, most embedding models should be deterministic and should work forever. [0] https://huggingface.co/blog/matryoshka |
![]() |
|
I've been adding embeddings to every project I work on for the purpose of vector similarity searches. I was just trying to order uber eats and wondering why they don't have a better search based off embeddings. Almost finished building a feature on JSON Resume, that takes your hosted resume and WhoIsHiring job posts and uses embeddings to return relevant results -> https://registry.jsonresume.org/thomasdavis/jobs |
![]() |
|
I think he is saying: embeddings are deterministic, so they are more predictable in production. They’re still magic, with little explain ability or adaptability when they don’t work. |
![]() |
|
Can someone give a qualitative explanation of what the vector of a word with 2 unrelated meanings would look like compared to the vector of a synonym of each of those meanings?
|
![]() |
|
If you think about it like a point on a graph, and the vectors as just 2D points (x,y), then the synonyms would be close and the unrelated meanings would be further away.
|
![]() |
|
ah pgvector is kind of annoying to start with, you have to set it up and maintain, and then it starts falling apart when you have more vectors
|
Presto -- embeddings! And you can use cosine similarity with them and all that good stuff and the results aren't totally terrible.
The rest of "embeddings" builds on top of this basic strategy (smaller vectors, filtering out words/tokens that occur frequently enough that they don't signify similarity, handling synonyms or words that are related to one another, etc. etc.). But stripping out the deep learning bits really does make it easier to understand.