@neuralmagic
Neural Magic
1 year
SparseGPT uses a pruning mask to set weights not in the mask to 0 and the rest to their current value. In this article, we explore the internal workings of SparseGPT in more detail.
1
3
17

Replies

@neuralmagic
Neural Magic
1 year
Deploying a GPT-175B requires 5 A100 80GB GPUs, each costing $15,000. That's $75,000 for inference 💰 You can reduce the model’s size by removing 50% of the weights without losing accuracy 🤯 Let's explore how to do that with the #SparseGPT algorithm. -A quick thread- 🧵👇
5
25
129
@neuralmagic
Neural Magic
1 year
SparseGPT is a post-training pruning method for compressing #LLMs like GPT-3. SparseGPT can prune LLMs in one-shot and with minimal accuracy loss, like OPT-175B to 50% #sparsity . With SparseGPT, you can prune a larger proportion of the weights as the model gets bigger.
1
0
6
@neuralmagic
Neural Magic
1 year
100 billion weights from the large language model can be ignored at inference time, thanks to SparseGPT. This increases the model's throughput while reducing latency. Removing these weights inadvertently leads to a smaller model that is way more economical to deploy.
1
0
6