How We Made Large AI Models 40% Smaller (And Open Sourced the Whole Thing)

Tom Turney, Founder & CTO, PsyGuard.AI

This week we achieved something we have been building toward for a the past week. We compressed large AI model weights by 28-42% with barely measurable quality loss. No retraining. No calibration data. No special hardware. Take an existing model, run one command, get a smaller version that produces nearly identical output.

This is not a paper. This is running code with real results, tested by real people on real hardware, right now.

The Results

For this V1 breakthrough, we tested across six models and three model families. These are not projections.

27 billion parameter model: 26.6 GB down to 19.1 GB. Quality loss: 1.3%.
72 billion parameter model: 72 GB down to 45.8 GB. Quality loss: 3.9%.
14 billion parameter model: 14.5 GB down to 9.3 GB. Quality loss: 1%. Actually runs 2.5x faster.
70 billion parameter model: 69.8 GB down to 40.2 GB. Beats the existing standard compression at the same size.

When combined with our KV cache compression work from last week, a 35 billion parameter model running a long conversation uses 59% of the memory it used to. That is not incremental. That is a step change. The model is smaller and it has more room to think. Both compressions stack, so you are not choosing between a smaller model or more conversation memory. You get both.

Within hours of publishing the code, a community member tested it on dual NVIDIA 4090 GPUs. It worked. Another tested on an AMD GPU. It worked there too, and the compressed model actually ran 30% faster for AMD than the original because the smaller size means less data to move through memory.

The Problem

There is a growing movement of people and organizations running AI locally. On their own hardware. Not because cloud AI is bad, but because privacy matters, costs add up, and not everyone wants their data on someone else's servers.

The problem is simple. Capable AI models are too big. A 70 billion parameter model needs over 70 GB of memory. Most hardware does not have that. So people either pay for cloud inference or settle for smaller, less capable models.

This week, that equation changed. A 70 GB model is now 40 GB. A 35 GB model at full conversation length is now 23 GB. Models that did not fit now fit. On hardware people already own.

How We Got Here

Last week we open sourced a technique for compressing the working memory of AI models, the part that grows as conversations get longer. That project attracted 50+ contributors testing across Apple, NVIDIA, and AMD hardware. 5,100 GitHub stars. People found bugs, ported code to new platforms, and validated results on models we never had access to. Businesses have already started using it in production. LocalAI, an open source inference platform with 44,000+ GitHub stars, officially integrated our compression into their recommended stack.

The weight compression breakthrough this week builds on everything we learned from that work. The same mathematical principles that compress working memory also compress the model weights themselves. We discovered that different parts of the model have dramatically different sensitivity to compression. Attention layers, feed-forward layers, boundary layers all respond differently. By compressing each part according to its actual sensitivity instead of applying the same compression everywhere, we get better quality at smaller sizes than uniform approaches.

The weight compression work moved fast because the KV cache infrastructure and community were already in place.

What PsyGuard.AI Is Doing With This

PsyGuard runs AI-simulated focus groups. Our system creates psychologically-profiled participants, runs structured discussions, and delivers insights reports in 24 hours instead of the weeks and thousands of dollars that traditional focus groups cost.

Compression unlocks three things we are building toward.

Smarter participants. Larger models produce more nuanced, more consistent simulated participants. When a 72 billion parameter model fits where it previously did not, the quality of every focus group improves.

Longer conversations. Focus groups are only useful if the AI remembers the entire discussion. Combined weight and memory compression means dramatically longer conversations without losing context.

True privacy. Healthcare, finance, and consumer research organizations handle data that cannot leave their walls. Running models locally, without sending anything to external servers, is not optional for them. Compression makes that feasible on realistic hardware.

Why We Open Source This

The community makes it better faster than we could alone. In three weeks, contributors found bugs we would not have caught for months, ported our code to platforms we do not develop on, and validated quality on dozens of models. One contributor ran it in production on datacenter GPUs and found a configuration bug that only appears on a specific model architecture. That kind of testing is impossible for a single company.

Every improvement they make benefits PsyGuard. Every improvement we make benefits them. This is not charity. This is how serious infrastructure gets built.

Where This Goes

The code is public. The results are real. We are gathering community validation across more model families and hardware configurations before merging into the main codebase. This is how we ship. Test in the open, fix what breaks, merge when it is proven.

A year ago, running a 70 billion parameter model locally required specialized hardware. Today it runs on a single consumer GPU. The gap between what fits on your hardware and what is useful is closing fast. We are building PsyGuard.AI for that future, and we are bringing the community with us.

The weight compression code, research paper, and step-by-step testing instructions are on GitHub. If you have a GPU and five minutes, we would appreciate your help testing.

PsyGuard.AI offers AI-powered virtual focus groups at 80%+ less cost than traditional methods, with 24-hour turnaround. Learn more at psyguard.ai.