SparseGPT uses a pruning mask to set weights not in the mask to 0 and the rest to their current value. In this article, we explore the internal workings of SparseGPT in more detail. Tweet added by Neural Magic @neuralmagic

Neural Magic

1 year

SparseGPT uses a pruning mask to set weights not in the mask to 0 and the rest to their current value. In this article, we explore the internal workings of SparseGPT in more detail.

SparseGPT: Remove 100B Parameters For Free - Neural Magic

Compress large language models (LLMs) with SparseGPT to make your machine learning inference fast and efficient. Prune in one-shot with minimal accuracy loss.

neuralmagic.com

1

3

17

Neural Magic

@neuralmagic

1 year

Deploying a GPT-175B requires 5 A100 80GB GPUs, each costing $15,000. That's $75,000 for inference 💰 You can reduce the model’s size by removing 50% of the weights without losing accuracy 🤯 Let's explore how to do that with the #SparseGPT algorithm. -A quick thread- 🧵👇

5

25

129

Neural Magic

@neuralmagic

1 year

SparseGPT is a post-training pruning method for compressing #LLMs like GPT-3. SparseGPT can prune LLMs in one-shot and with minimal accuracy loss, like OPT-175B to 50% #sparsity . With SparseGPT, you can prune a larger proportion of the weights as the model gets bigger.

1

0

6

Neural Magic

@neuralmagic

1 year

100 billion weights from the large language model can be ignored at inference time, thanks to SparseGPT. This increases the model's throughput while reducing latency. Removing these weights inadvertently leads to a smaller model that is way more economical to deploy.

1

0

6

Replies