A polynomial autoencoder beats PCA on transformer embeddings

原始链接: https://ivanpleshkov.dev/blog/polynomial-autoencoder/

Hacker Newsnew | past | comments | ask | show | jobs | submitloginA polynomial autoencoder beats PCA on transformer embeddings (ivanpleshkov.dev)8 points by timvisee 1 hour ago | hide | past | favorite | 1 comment help pleshkov 1 hour ago [–] Author here — questions and pushback both welcome.reply Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact Search:
相关文章

原文

The most direct way to compress an embedding (other than quantization) is to fit PCA on the corpus and keep the top-d eigenvectors. It works, but PCA is a linear projection, and neural-network embeddings on the sphere are structurally nonlinear — the well-known cone effect in transformers. Some of the variance lives in a nonlinear tail that a linear decoder can’t reach.

This post is about a closed-form way to add a quadratic decoder on top of PCA, to capture part of that nonlinear tail. The encoder stays as plain PCA. The decoder is a degree-2 polynomial lift plus Ridge OLS (ordinary linear regression with L2 regularization), also closed-form. No SGD, no epochs, no hyperparameter search. One np.linalg.solve over corpus statistics.

The construction itself isn’t mine. “PCA encoder + quadratic decoder + least-squares fit” appears in the dynamical-systems literature under the name quadratic manifold (see Jain 2017, Geelen-Willcox 2022/2023, Schwerdtner-Peherstorfer 2024+ — more in §9). I only stumbled onto these papers after running my experiments and believing the construction was new. The polynomial lift doesn’t come up much in modern ML conversations, and this post is a note about a useful trick from an adjacent discipline carrying over to retrieval.

The concrete result. BEIR/FiQA, mxbai-embed-large-v1 (1024d), per-vector budget 512 bytes. Metric is NDCG@10 (Normalized Discounted Cumulative Gain over the top-10, the standard retrieval ranking-quality measure; range [0, 1], higher is better):

methodNDCG@10Δ vs raw 1024d
raw 1024d (4096 bytes)0.4525
PCA top-2560.4168-3.58 p.p.
poly-AE 256d0.4441-0.85 p.p.
matryoshka top-2560.4039-4.86 p.p.

PCA already gives 4× per-vector memory compression at -3.58 p.p. NDCG. A quadratic decoder on top of PCA pulls another +2.73 p.p. — closing almost the entire gap to raw, at the same byte budget. Matryoshka is in the table as another familiar baseline (here it drops more than PCA — a known side observation, not the central claim of this post; see §3 footnote and §4).

Measured on four models (nomic-v1.5, mxbai-large, bge-base, e5-base). Poly-AE over PCA: +1 to +4.4 p.p. at d=128, +0.03 to +2.7 p.p. at d=256. Full table in §3.

Full implementation — ~150 lines of numpy, MIT, repository github.com/IvanPleshkov/poly-autoencoder. The BEIR eval script is beir_eval.py in the same repo. Reproduces in 30–40 minutes on an M-series MacBook (10–15 min to encode the corpus, ~15 min for the Ridge solve at d=256).

Contents:

1. What we’re comparing

Four lines in every measurement:

  • raw — the full embedding, no compression. Quality ceiling and the most bytes per vector. The “expensive” baseline that gives a fair ceiling.
  • matryoshkaembedding[:d] plus L2 normalization. On models trained with a matryoshka loss (nomic, mxbai in our sample), this is a valid matryoshka vector. On models without matryoshka training (bge-base, e5-base) it’s a test of what happens when you naively slice a non-MRL model — the scenario users of the bge family, e5, and custom-fine-tuned embeddings actually fall into.
  • PCA — top-d eigenvectors of the corpus covariance. Vectors live in d-dimensional PCA coordinates.
  • poly-AE — our method. Encode with PCA into p ∈ ℝ^d, decode with a quadratic polynomial back to a full D-dimensional , retrieve on .

At a fixed d, all four methods store 2d bytes per vector (fp16 coordinates).

2. Experimental setup

BEIR is the standard set of retrieval datasets (SciFact, FiQA, NFCorpus, TREC-COVID, etc.). The metric is NDCG@10. Corpus and queries are encoded with the chosen model, top-10 are retrieved by cosine similarity, and NDCG is computed against the labeled qrels.

PCA and poly-AE are fit transductively: statistics are computed on the corpus we want to compress. Queries never participate in the fit — they hit a fixed encoder/decoder at inference time. This matches a production deployment: an index operator computes PCA + Ridge once on their data and then serves queries.

For the main runs we use FiQA — 57K documents, 648 queries, 1706 qrels.

3. The headline table — four models

NDCG@10 on FiQA at budgets of 256 fp16 (512 bytes/vector) and 128 fp16 (256 bytes):

ModelDdrawmatryoshka†PCApoly-AEpoly over matryoshkapoly vs raw
nomic-embed-text-v1.57681280.37460.31900.32730.3380+1.90 p.p.-3.65 p.p.
nomic-embed-text-v1.57682560.37460.35080.36700.3673+1.65 p.p.-0.73 p.p.
mxbai-embed-large-v110241280.45250.35030.36890.4129+6.26 p.p.-3.97 p.p.
mxbai-embed-large-v110242560.45250.40390.41680.4441+4.02 p.p.-0.85 p.p.
bge-base-en-v1.5*7681280.40620.29140.32660.3654+7.40 p.p.-4.09 p.p.
bge-base-en-v1.5*7682560.40620.35740.36880.3958+3.84 p.p.-1.05 p.p.
e5-base-v2*7681280.39870.24980.30650.3317+8.18 p.p.-6.70 p.p.
e5-base-v2*7682560.39870.33330.36180.3852+5.19 p.p.-1.35 p.p.

† The “matryoshka” column is embedding[:d] plus L2 normalization. On nomic and mxbai it’s a valid matryoshka vector. On models marked with * (bge-base, e5-base) the model wasn’t trained for matryoshka, and the slice here is a test of what happens when you naively slice a non-MRL model. This is the scenario that users of the bge family, e5, and custom-fine-tuned embeddings actually fall into — and we measure it here honestly.

What this shows:

  1. Poly-AE is consistently ahead of PCA across all four models. Lift: +1 to +4.4 p.p. NDCG at d=128, +0.03 to +2.7 p.p. at d=256. Where the quadratic decoder actually helps and at what d — discussed in §4.
  2. At d=256, poly-AE loses 0.7–1.4 p.p. NDCG vs raw 768/1024 on all four models. 4× per-vector memory compression for less than 1.5 p.p. lost — the main number of the post.
  3. On non-matryoshka-trained models, the matryoshka column drops more than PCA — up to -15 p.p. NDCG at d=128. This is a side observation: the post compares PCA and poly-AE, not PCA vs matryoshka. If the matryoshka numbers in the table look surprising, there’s a short pointer in §4.

4. Where the quadratic decoder actually helps

PCA is the linear baseline. The quadratic decoder adds the nonlinear piece that the linear one can’t reach (mechanics in §5). How much does that actually help on retrieval, and at what d?

Poly-AE lift over PCA, by model and d:

Modelpoly over PCA, d=128poly over PCA, d=256
nomic-v1.5+1.07 p.p.+0.03 p.p.
mxbai-large+4.40 p.p.+2.73 p.p.
bge-base+3.88 p.p.+2.70 p.p.
e5-base-v2+2.52 p.p.+2.34 p.p.

The picture:

  1. At d=128 (8× compression) poly is consistently 1–4 p.p. ahead of PCA. This is the regime where the linear decoder starts dropping noticeable variance into the nonlinear tail, and the quadratic correction pulls it back. Sweet spot for the method.

  2. At d=256 (4× compression) the gap is uneven. On mxbai/bge/e5 — a stable +2.3–2.7 p.p. On nomic — close to zero (+0.03). Likely reason: nomic was carefully trained with multi-slice contrastive loss, its latent is more isotropic, and at d=256 the linear projection already takes most of what’s there. On non-MRL models the nonlinear tail is bigger → poly helps more.

  3. More anisotropy → bigger lift. The stronger the cone effect, the more variance lives in the nonlinear tail that PCA can’t reach but poly can. That’s the geometry §5 unpacks.

Side: where matryoshka sits in the table

In §3 you can see that on non-MRL models (bge, e5) the matryoshka column drops more than PCA — i.e. on a random non-MRL-trained model, naive slicing works worse than a corpus-side linear projection. This is a known result; the “MRL vs PCA on retrieval” question has been discussed independently of this post — see Matryoshka-Adaptor 2024, SMEC “Rethinking MRL” 2025, CoRECT 2025, and the YouTube video literally titled «Is PCA enough?». This post compares PCA and poly-AE; matryoshka is in the §3 table as a third reference point.

Where poly-AE doesn’t apply

A corpus-side PCA fit is a required part of the method. That means poly-AE doesn’t work when the corpus isn’t available:

  • multi-tenant SaaS: one model, thousands of clients with different corpora — fitting PCA per client is operational pain;
  • streaming indices: statistics drift over time, PCA needs periodic refits;
  • edge inference: phone, browser, embedded — you don’t want to ship a per-client PCA matrix alongside the model.

In those settings you want an MRL-trained model and embedding[:d], and poly-AE isn’t an alternative — it also needs corpus statistics.

Practical takeaway

In the operator-fit setting (fixed corpus, the operator fits compression once), you have two working modes:

  • d=256 gives 4× compression at -0.7 to -1.4 p.p. NDCG vs raw. Poly over PCA: from +0.03 (nomic) to +2.7 p.p. (mxbai/bge/e5). On an MRL-trained model like nomic the gap to PCA is minimal; on non-MRL models poly is clearly ahead.
  • d=128 gives 8× compression. Poly over PCA: +1 to +4.4 p.p. on any model. Sweet spot for the method.

5. Why a linear projection loses information

PCA is the best possible linear projection of data into a d-dimensional subspace. But “best linear” doesn’t mean “good enough”: if the data has nonlinear structure, a linear decoder can’t reach it, period.

Transformer embeddings have such structure, well-studied — the cone effect. The point cloud is concentrated inside a narrow cone on the unit sphere and is heavily nonlinearly structured inside that cone.

isotropic anisotropic — example

Left: isotropic data. Right: one example of what anisotropic data can look like.

drag to rotate

PCA only catches projections along orthogonal eigenvectors — i.e. the linear ellipsoid that the cloud doesn’t actually fit in. Whatever lives in the curvature of the manifold is structurally invisible to a linear decoder. The stronger the anisotropy (the narrower the cone), the more variance sits in this nonlinear tail that PCA structurally can’t reach.

So what we want is clear. A decoder that can deal in quadratic combinations of coordinates — i.e. a decoder that captures local curvature of the manifold. Then we’d recover some of the information that PCA loses.

6. A polynomial decoder via a linear lift

§5 made it clear that we need a nonlinear decoder. The straightforward route is to train a neural network. But then we lose the closed-form pipeline: SGD, learning rate, batch size, convergence, early stopping. We’d like a nonlinear decoder solvable by a single formula.

Standard regression trick: the polynomial lift. Take vector p and lift it into all monomials up to degree 2 — bias, linear terms, squares, and pairwise products. On a 2D example:

lift([p₁, p₂]) = [1,   p₁, p₂,   p₁², p₁·p₂, p₂²]
                  ↑    ↑   ↑     ↑    ↑      ↑
                bias  linear     quadratic

Any linear combination of these six = a quadratic function of the original p₁, p₂. So a linear regression on the lift = a quadratic regression in the original space. No nonlinear optimization needed, plain Ridge OLS in closed form does it.

Apply this to our case. The encoder is top-d PCA (p = (V − V̄) @ Q, p ∈ ℝ^d). Lift p into all monomials up to degree 2:

def polynomial_lift(p, degree=2):
    features = [1.0]                              # bias
    features.extend(p)                            # degree 1
    for i in range(len(p)):
        for j in range(i, len(p)):
            features.append(p[i] * p[j])          # degree 2
    return np.array(features)

The lift dimension is M = (d+1)(d+2)/2. For d=128 that’s M = 8385, for d=256 it’s M = 33153. The decoder is a linear regression from the M-dimensional lift back to the D-dimensional original. Despite being “just a linear regression”, the resulting map p → V̂ is quadratic — exactly what we needed.

Training the decoder is one np.linalg.solve:

def fit_decoder(P, V, lam=1e-3):
    L = polynomial_lift(P, degree=2)               # (N, M)
    G = L.T @ L + lam * np.trace(L.T @ L) / L.shape[1] * np.eye(L.shape[1])
    W = np.linalg.solve(G, L.T @ V)
    return W

The polynomial autoencoder is assembled:

Polynomial autoencoder schema: V → PCA → p → poly lift → L → Ridge OLS → V̂; V − V̂ → V_resid
  • encoder — closed-form linear (PCA), no training,
  • decoder — closed-form quadratic (Ridge OLS on the polynomial lift), trained with one np.linalg.solve,
  • reconstructionV̂ = polynomial_lift(p) @ W,
  • residualV_resid = V − V̂ (useful in §8 for quantization).

No backprop, learning rate, batch size, epochs. On 100K vectors at d=100, a couple of minutes on CPU. At d=256, around 15 minutes (Ridge solves an system, cubic in M).

One technical note on normalization

Raw p after PCA has huge anisotropy across its coordinates (the first axis ~50× the variance of the last) — bad for Ridge: large axes dominate the solution, small ones get under-weighted. Fix is per-axis std normalization:

p_normed = p / np.sqrt(eigvals)
p = p_normed * (0.9 / np.linalg.norm(p_normed, axis=1).max())

The global 0.9 / max-norm factor gives ‖p‖ ≤ 0.9 — numerical stability under squaring. Doesn’t affect NDCG, but cleaner numerically.

7. The small-corpus measurement and in-sample magic

A methodological note. On the first sanity check on SciFact (5183 docs, 300 queries), poly-AE at d=256 showed NDCG@10 0.6980 vs raw 0.7032 — losing only 0.5 p.p. The reported gain over PCA there was +1.36 p.p.

On FiQA (57638 docs), the same experiment gave:

  • poly vs raw: -0.73 p.p. (stable, like SciFact);
  • poly over PCA: +0.03 p.p. (down from +1.36 on SciFact).

What happened. On SciFact at d=256 Ridge solves a regression with M=33153 features on N=5183 vectors — severely overdetermined: the in-sample V̂_corpus matches V_corpus almost exactly, so retrieval on V̂_corpus is equivalent to retrieval on the full 768-d V_corpus. The SciFact poly-AE measurement inflated its gap with PCA because it memorized the corpus in the Ridge weights, not because it generalized better.

What’s important: the poly-vs-raw drop is stable (-0.5 to -0.85 p.p.) across both corpora. But for poly-vs-PCA, on small corpora (N ≤ 10K at d=256) the gap is partly mythical. A safe rule of thumb: N should be at least 5–10× larger than M = (d+1)(d+2)/2. For d=256 that’s N ≳ 200K. For d=128 it’s N ≳ 50K.

In the headline table (§3), FiQA’s N = 57K is enough for d=128 and borderline for d=256. The bge numbers on FiQA at d=256 should be read with that caveat.

8. Compression — what to do with the residual

The autoencoder doesn’t compress anything by itself, strictly speaking. It repackages a vector into a pair (p, V_resid) — latent code plus residual. Same information, different shape. To get actual compression, the residual has to be quantized.

This is where poly-AE’s anisotropy-removing property matters. On DBpedia-OpenAI at d=100, cond(V_resid) drops from 72 to 3.4 — almost isotropic residual. Linear PCA at the same d gives cond=4.21 — noticeably less isotropic.

Google Research recently published TurboQuant — a quantizer with a fixed random rotation that on isotropic distributions gets close to the theoretical compression limit (Shannon rate-distortion for normals). The composition is natural: the autoencoder strips out the anisotropy, TurboQuant efficiently quantizes the now-isotropic residual.

V  ──►  encoder (PCA)  ──►  p

        decoder (poly+Ridge)  ──►  V̂

V  −  V̂  ──►  V_resid  ──►  TurboQuant  ──►  compact code

For Qdrant 1.18 I shipped a TurboQuant modification with anisotropy compensation. It achieves roughly the same recall as the “poly-AE + residual quantization” hybrid, but without needing to store the poly-AE latent coordinates — saving the p-side of the per-vector byte budget. The tricks I layered on top of standard TurboQuant will be in a separate post.

9. Where the method came from and where it’s already used

I’ve been familiar with the polynomial lift since my time at ACD/Labs — in cheminformatics it’s a standard tool for building features from molecular structure (as input for downstream regression). Currently I work at Qdrant. When I was implementing late-interaction retrieval support there (ColBERT-style multi-vectors with maxsim, v1.10), I remembered the trick — and that’s how the “PCA + quadratic decoder” idea came together. I thought no one had tried it for neural embeddings yet.

When the draft went to Sava Kalbachou for review, he pointed out specific papers where this exact construction has long been used — in numerical modeling of physical systems, under the name quadratic manifold:

In those papers the construction is applied to solution-trajectory snapshots of physical-system simulations. Embeddings are a different data geometry, and it’s not obvious in advance that quadratic manifold should also work there; the numbers in §3 show that it does.

Why this combination doesn’t seem to be in mainstream ML

These are just speculations. The communities probably don’t overlap much: quadratic-manifold work appears in engineering journals and rarely surfaces in ML research.

Also, in ML’s collective memory polynomial methods are mostly associated with the kernel trick (SVM, kernel PCA) — a different construction, and the explicit polynomial lift as a stand-alone technique rarely comes up after that.

And an “autoencoder” is usually reflexively understood as a neural network in modern ML; closed-form variants may simply fall outside the usual field of attention.

So this is an applied-ML post: a technique developed in an adjacent field turns out to be useful for embeddings as well. The contribution of the post is the port and the empirical check.

10. Limits of the method

A few honest caveats:

  • Cubic Ridge solve. At d=256, M = 33K, and np.linalg.solve in closed form costs O(M³) ≈ 5–15 minutes on CPU. At d=384 that’s already M = 74K and is unmanageable. The method is comfortable up to d ≈ 200–256. For larger d you’d need randomized methods (random feature approximation), or move to a neural decoder.
  • Transductive fit. PCA + Ridge are computed on the corpus we intend to compress. If the corpus drifts over time, decompression degrades. Fine for static indices, problematic for streaming.
  • Small N. When N < 5M (i.e. N < 200K at d=256), Ridge starts memorizing the corpus, in-sample retrieval improves due to overfitting rather than due to structure. The signal: a poly-vs-PCA gap that’s too large probably means overfit.
  • Degree 3 doesn’t work yet. A degree-3 lift gives M = O(d³) — for d=100 that’s 175K features. Fits in memory, but the Ridge solve is impractical and overfitting climbs.

11. What to try next

Obvious branches:

  • Bigger corpora, heavier models. All measurements in the post are on FiQA with models in the 305–560M-param range. Worth running on MS MARCO (8.8M passages) with e5-mistral-7b-class models. The cubic Ridge becomes the bottleneck — you’d need random-feature approximation.
  • Lift degree. At what d/N does degree 3 start to actually help? Are there embeddings with significant 3rd-order tensor structure that we’re losing now?
  • The regime where you can’t compute matryoshka. Multi-tenant SaaS, edge inference — there poly-AE without PCA won’t work either (it needs corpus stats). But a hybrid: matryoshka in the model + polynomial decoder on the client — should give the best of both worlds.

Code and reproduction

Loaders for BEIR (SciFact, FiQA, NFCorpus, TREC-COVID), encoders (nomic-v1.5, mxbai-large, bge-base, e5-base), PCA-init with per-axis std, polynomial lift, Ridge OLS, retrieval with NDCG@10/Recall@10 — all in github.com/IvanPleshkov/poly-autoencoder.

Minimal run:

git clone https://github.com/IvanPleshkov/poly-autoencoder.git
cd poly-autoencoder
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt pytrec_eval einops
python beir_eval.py --model nomic-embed-text-v1.5 --dataset fiqa --d 128,256

On an M-series MacBook the first run downloads the model and dataset, encodes the corpus (5–15 min depending on the model), and over the next 5–20 minutes prints a table with NDCG@10 for all four methods.

联系我们 contact @ memedata.com