Deploying a GPT-175B requires 5 A100 80GB GPUs, each costing $15,000.
That's $75,000 for inference 💰
You can reduce the model’s size by removing 50% of the weights without losing accuracy 🤯
Let's explore how to do that with the
#SparseGPT
algorithm.
-A quick thread-
🧵👇
SparseGPT is a post-training pruning method for compressing
#LLMs
like GPT-3.
SparseGPT can prune LLMs in one-shot and with minimal accuracy loss, like OPT-175B to 50%
#sparsity
.
With SparseGPT, you can prune a larger proportion of the weights as the model gets bigger.
100 billion weights from the large language model can be ignored at inference time, thanks to SparseGPT.
This increases the model's throughput while reducing latency.
Removing these weights inadvertently leads to a smaller model that is way more economical to deploy.
SparseGPT uses a pruning mask to set weights not in the mask to 0 and the rest to their current value.
In this article, we explore the internal workings of SparseGPT in more detail.
@neuralmagic
Another game changer - LLMs will be everywhere, and running on every device in a few years... (they are already running on RasPis)
HYYYYPEEEEE