Steven A. Thompson

WIZORB: The Search for the Universal Subspace & LLM Compression

project_wizorb

The Story

I learn by doing, so I decided to learn LLM's by figuring out how to compress one by vibe coding real implementations of things I found in research papers.

For the first challenge, I decided to investigate the Universal Weight Subspace Hypothesis (UWSH) and try to determine if it was valid empirically. The papers authors have had a "Code Releasing Soon" badge on the Github for way too long in my decidedly non-academic opinion. My code is on GitHub now.

The hypothesis implies that the weight updates of Large Language Models (LLMs) trained on different tasks naturally converge to a shared, low-dimensional subspace. If true, this would also imply that transformer LLMs are geometrically sub-optimal and could be massively compressed.

My data indicates the Strong UWSH (that models naturally converge to this subspace) is false. However, a Weak UWSH (that a subspace exists where the model can live) is true, and we can use it to do real work.

I initially suspected that the "Universal Basis" found in literature might just be a Johnson-Lindenstrauss (JL) projection—essentially, that any random projection would work if the dimensions were high enough. My final control tests proved this wrong. I compared my extracted basis against a purely random one. The random basis failed to improve the model significantly, while the extracted basis provided a statistically significant boost. This implies there is a "Ghost in the Machine" - a hidden functional structure specific to tasks that we can extract. I suspect that this will hold true for any task, but can only empirically prove that it worked for coding.

The most important discovery was the "Projection Paradox." I found that I could project a model into a subspace where it lost 99% of its geometric information (the weights looked "wrong" mathematically), yet it retained 96% of its functional intelligence (it still wrote code perfectly). This implies that "Weight Geometry" and "Model Function" are decoupled in ways we didn't fully appreciate due to the massive degeneracy (redundancy) in neural networks.

I'm working on the next stage now. I plan to use this "Projection Paradox"—specifically the discovery that we can "repair" broken geometry using coefficient training—to project very large LLMs into much smaller ones, effectively compressing them by 40-50% while preserving functionality. It will take a while. I only have a single GB10-based computer to work with, and I have a real job that pays me to do real work. :)

The GitHub link with the actual code, data, reproducibility steps, etc is here.

NOTE 1: This finding strongly resonates with the Lottery Ticket Hypothesis, but suggests that "winning tickets" might be retrievable via projection and repair rather than just iterative pruning.

NOTE 2: Most of this post was written or at least edited by an AI. The AI was instructed to summarize my data in the form of an academic paper, and to expand upon my own cursory notes because "ain't nobody got time for that."™️I have a lot of data, and not a lot of time. I did at least proofread all of it. If you don't like that an AI assisted here you're entitled to a full refund of everything you paid to read this.

---

Abstract

I ran Project WIZORB as a one-person effort to test (and, if possible, operationalize) the Universal Weight Subspace Hypothesis (UWSH) for large language models: the claim that across many fine-tunes of the same architecture, task-relevant weight updates concentrate into a small shared subspace, enabling compression into a universal basis plus small per-task coefficients.

Using LoRA deltas as the update representation, I trained a delta bank of 20 coding adapters on shards of nvidia/OpenCodeInstruct, extracted a per-module universal basis via a PCA/HOSVD-like procedure, and attempted coefficient-only training over a frozen basis to improve coding benchmarks (HumanEval+ and MBPP+). This direct training approach failed repeatedly: learned coefficient adapters did not reliably outperform the base model, and basis diagnostics showed extreme holdout reconstruction failure (relative L2 error ≈ 1.0).

I then pivoted to a representation-vs-optimization test: instead of learning coefficients from scratch, I projected a known-good LoRA solution into the extracted basis. This revealed what I call the Projection Paradox: the projected solution had ~98–99% relative L2 error (geometrically “orthogonal” to the target weights), yet it preserved nearly all functional uplift on coding benchmarks. With appropriate statistical power and paired testing (McNemar), I established a minimum functional rank of kmin64 for this setup.

Crucially, I falsified the "it’s just a random projection" hypothesis using random and secondary-subspace controls. At rank 64, the extracted Universal Basis yielded statistically significant uplift (p=0.0195) while the Random Basis did not (p=0.125).

Overall, my results disconfirm the strong form of UWSH—at least as “geometrically recoverable from a small adapter bank” in this domain—while supporting a weaker claim: a small functional subspace exists, but independently trained solutions do not appear to converge to it geometrically, and using it requires explicit re-alignment (“repair”) rather than naïve variance-based truncation.

---

1. Introduction

Fine-tuning large LLMs is operationally expensive: storing many full checkpoints is costly, and even parameter-efficient fine-tunes (e.g., LoRA) create a proliferation of adapters. The Universal Weight Subspace Hypothesis is appealing because it promises:

  1. a shared low-dimensional basis that captures most task-relevant variation across fine-tunes, and
  2. cheap per-task adaptation via small coefficient vectors instead of new weight matrices.

WIZORB was my attempt to validate this hypothesis under a concrete applied target: improving coding performance of a frozen LLaMA-3.1-8B base model via “universal basis + coefficient training.”
I designed WIZORB explicitly as a falsifiable ladder:

The project evolved substantially because early results contradicted the strong UWSH expectations, forcing me to separate representation capacity from optimization difficulty.

---

2. Background and hypothesis

2.1 Universal subspace framing

Let a base model have parameters θ0D. A task adaptation yields parameters θi=θ0+Δθi.

A common “universal subspace” claim can be written as:

Δθik=1Kci,kbkwithKD

where {bk}k=1K is a shared basis, and ci,k are task-specific coefficients.

In WIZORB I instantiated Δθi as LoRA deltas applied to a subset of linear layers, and I attempted to learn the c’s for a coding task while keeping the basis fixed.

2.2 LoRA parametrization (what I actually used)

For a linear layer W0dout×din, LoRA applies:

W=W0+α/r·(BA)

with rank r, Ar×din, Bdout×r, and scaling α.

WIZORB’s “universal-basis LoRA” does not train A,B directly. Instead it reconstructs them from a fixed basis and trains small coefficient tensors.

---

3. Experimental setup

3.1 Model, data, and evaluation

3.2 Target modules

I targeted the common transformer linear modules:

3.3 Delta bank (Phase 2)

I trained 20 LoRA adapters (rank r=16), each on a 500-sample shard of OpenCodeInstruct, to create a “delta bank.” The intent was to provide enough variation to estimate a shared subspace.
(In hindsight, this bank was likely too small and too homogeneous to satisfy the strong UWSH assumptions.)

---

4. Basis extraction method (what “HOSVD” meant in practice)

4.1 Canonicalizing LoRA rank factors

Raw LoRA factors have non-identifiabilities (scales and rotations inside the rank dimension). I partially canonicalized each adapter update by computing an SVD of the implied delta:

ΔW=BAUSV

Then for each singular component j, I formed “rank vectors”:

aj=SjVj,bj=SjUj

and applied a scale-invariant normalization:

s=aj/bj,aj=aj/s,bj=bj·s

Finally I concatenated:

vj=[ajbj]din+dout

4.2 PCA/HOSVD-style basis per module

For each module, I stacked all vj from all adapters into a matrix X and computed an SVD of the centered data:

Xc=Xμ,Xc=UΣV

The “universal basis” for that module was the top-k rows of V, with mean μ.
I experimented with:

Most later phases used the Separate method.

---

5. Universal-basis coefficient adapter (training target)

In my implementation, each rank row is reconstructed as:

Vj=μV+cj(A)CV,Uj=μU+cj(B)CU

and the effective delta is:

ΔW=UV

Only the coefficients cj(A),cj(B) are trainable. For r=16 and k=32, this yields 229,376 trainable parameters total across the model (≈0.0028% of 8B).

---

6. Results by phase (and what each phase taught me)

Phase 0–1: Baseline and LoRA control (sanity)

On HumanEval+ (full, strict plus):

Model pass@1_base pass@1_plus (strict)
Base 37.20% (61/164) 30.49% (50/164)
LoRA control 40.85% (67/164) 36.59% (60/164)

The LoRA control uplift was real (+10 strict-plus tasks), but on n=164 it was borderline by McNemar (p≈0.087). This foreshadowed that power would matter.

---

Phase 4–6: Coefficient-only training over a frozen extracted basis (failure)

I trained coefficients on 2,000 OpenCodeInstruct samples (LR 1e4, 1 epoch). The key outcome: no robust coding uplift. My Phase 5 ablations (universal vs random vs secondary) did not support the hypothesis at k=32, and increasing to k=64 did not rescue it.

Separately, I ran basis quality diagnostics and found the most damning signal: holdout reconstruction was essentially zero even when train reconstruction improved.
For example at k=64:

This was my first major contradiction with the strong UWSH expectation (“shared low-rank structure should generalize across adapters”).

---

Phase 7–8: Rank sweeps and “overfit to the bank” confirmation

I performed rank sweeps up to the theoretical maximum implied by the bank:

I observed:

At k=256:

k Train error (rel L2) Holdout error (rel L2)
256 0.0001 0.967

This strongly suggested my extracted basis was learning idiosyncratic bank structure, not a transferable “coding subspace.”
At this point, the straightforward “universal basis + train coefficients” story looked dead.

---

Phase 9–10: Pivot to projection (representation vs optimization)

I realized I hadn’t cleanly separated two possibilities:

  1. the basis cannot represent a good coding solution (representation failure), vs
  2. the basis can represent it, but coefficient training can’t find it (optimization failure).

So I stopped training and started projecting.
Given a known-good LoRA adapter, I computed its normalized rank vectors and projected them onto the extracted basis:

c^=(vμ)C

This is an orthogonal projection when C is orthonormal.
The shock: projection retained performance, even though geometric error was extreme.
At k=320 (built from all 20 training adapters):

HumanEval+ full:

Model pass@1_plus (strict)
Base 50/164 (30.49%)
LoRA control 60/164 (36.59%)
Projected into basis (k=320) 58/164 (35.37%)

This is the core empirical phenomenon behind what I call the Projection Paradox.

---

Phase 11–13: Finding the minimum functional rank kmin

I sliced the basis and repeated projection at smaller ranks. Pilot sweeps suggested k=64 was a “knee” where function stayed high even as geometric error remained ~0.99.
On HumanEval+ (full n=164):

Model strict-plus passes McNemar vs Base
Base 50
Projected k=32 58 p=0.1153
Projected k=64 60 p=0.0525 (borderline)

HumanEval alone still lacked power. So I increased power by aggregating HumanEval+ with MBPP+ pilots.
Aggregated (HumanEval full n=164 + MBPP pilot n=200, total n=364):

Rank Base passes Proj passes McNemar p
k=32 158 171 0.0533 (borderline)
k=64 158 174 0.0195 (significant)

This was my first statistically strong evidence that a compact functional coding subspace exists at roughly k64.

---

Phase 14: Falsification controls (random basis and secondary components)

To test whether “any low-rank projection works,” I generated:

Projection fidelity (LoRA → basis) showed:

Basis type Mean rel. error
Universal (k=64) 0.994
Random (k=64) 0.9999
Secondary (k=64) 0.9997

Then I evaluated on the aggregated pool n=364:

Model Pass rate Uplift vs Base McNemar p (vs Base)
Base 43.41% (158/364)
Universal k=64 47.80% (174/364) +16 0.0195
Random k=64 44.51% (162/364) +4 0.1250
Secondary k=64 44.23% (161/364) +3 0.4531

This is the cleanest evidence I got that the extracted basis is not merely a Johnson–Lindenstrauss-style “lucky random projection.” At rank 64, random did something (+4 tasks), but the universal basis did ~ the uplift (+16 tasks) and was the only one to cross p<0.05 against base.

---

7. Interpretation: what I believe WIZORB established

7.1 I disconfirmed the strong form of UWSH (in my regime)

If the strong hypothesis were true in the most straightforward sense—“a small bank yields a geometrically meaningful universal subspace that new task deltas lie close to”—I would expect:

Instead, I consistently saw:

Geometrically, the extracted basis was almost orthogonal to the LoRA solution in L2 terms. That is incompatible with a literal “shared principal directions explain most task delta geometry” story—at least with my bank size, extraction method, and data.

7.2 The Projection Paradox: functional equivalence with geometric orthogonality

Despite the geometric failure, projection preserved coding performance remarkably well, especially at k64. Put plainly:

This suggests a sharp disconnect between:

My working explanation is that modern networks are extremely degenerate: there exist many parameter configurations that are far apart in L2 yet functionally near-equivalent. In that framing, PCA on raw deltas can easily track “sloppy” high-variance directions that have little to do with function, while the functionally relevant structure can be recoverable even when the geometric projection looks terrible.

7.3 The universal basis is not “just random”

Phase 14 is critical here. At rank 64:

So I do not think WIZORB reduced to “any 64D subspace works.” Something in the extracted basis was capturing task-relevant structure.

7.4 Why coefficient training failed (my best-supported hypotheses)

WIZORB produced two seemingly contradictory facts:

This is classic “representation exists, optimization fails.” My most plausible contributing factors are:

  1. Bank homogeneity (“shard problem”)
    My delta bank came from shards of the same dataset distribution. If across-bank variance is dominated by seed/SGD noise rather than semantic task variation, PCA will happily model noise.
  2. Residual gauge freedom / misalignment
    Even with SVD-based normalization, there are still degrees of freedom across adapters that can scramble parameter geometry without changing function. A basis extraction method that ignores these symmetries can average incompatible coordinates and make gradient-based learning harder than it needs to be.
  3. The “variance explained ⇒ importance” fallacy
    My results are consistent with the idea that large-variance directions are not necessarily the directions that matter for coding behavior. If true, it undermines the strong geometric reading of UWSH while still allowing a compact functional subspace to exist.

7.5 “WIZORB repair” (where I think this goes)

My experiments suggest that using a compressed/limited subspace requires an explicit re-alignment step. I am treating this as a "repair" process: the functional subspace exists, but the geometry is broken.

I am now moving to Project ZOMBORG: instead of finding a "universal" basis from many small models, I will use Fisher-Weighted SVD to find the specific functional basis of a single large model, and then use the WIZORB coefficient training loop to "repair" the model (via distillation) into that compressed state.

---

8. Limitations

I do not claim a universal theorem-level refutation of UWSH. My strongest claims are constrained to this regime:

A larger, more diverse bank (different datasets, different task families) and more careful adapter alignment could plausibly change the geometric story.

---

9. Conclusion

WIZORB began as an attempt to validate a clean operational promise: extract a small universal subspace from a bank of LoRA adapters, then train only coefficients to recover coding gains. That version failed. The extracted basis did not generalize geometrically, and coefficient training failed to produce robust uplift.

However, WIZORB did uncover a more interesting—and to me, more scientifically provocative—result: a low-dimensional functional coding subspace exists in the sense that a projected solution in that subspace can preserve LoRA-level gains, even while being nearly orthogonal in weight space. With paired statistical testing and increased task power, I established kmin64 for significant uplift in my projection setting, and I showed that the extracted basis outperforms random and secondary controls at the same rank.

My final position is:

---

10. Reproducibility and artifacts

Code, raw results, a giant mess, etc are all on GitHub here. Including evaluation outputs, and projection scripts along with the configs used for each phase. I can't include the tensors and basis files as they're too large. If anyone cares for them I'll label them as TOP SECRET and share them on the WarThunder forums or something. They're easy enough to reproduce with the scripts if you have a bit of compute around.