@neuralmagic
Neural Magic
1 year
Deploying a GPT-175B requires 5 A100 80GB GPUs, each costing $15,000. That's $75,000 for inference 💰 You can reduce the model’s size by removing 50% of the weights without losing accuracy 🤯 Let's explore how to do that with the #SparseGPT algorithm. -A quick thread- 🧵👇
5
25
129

Replies

@neuralmagic
Neural Magic
1 year
SparseGPT is a post-training pruning method for compressing #LLMs like GPT-3. SparseGPT can prune LLMs in one-shot and with minimal accuracy loss, like OPT-175B to 50% #sparsity . With SparseGPT, you can prune a larger proportion of the weights as the model gets bigger.
1
0
6
@neuralmagic
Neural Magic
1 year
100 billion weights from the large language model can be ignored at inference time, thanks to SparseGPT. This increases the model's throughput while reducing latency. Removing these weights inadvertently leads to a smaller model that is way more economical to deploy.
1
0
6
@neuralmagic
Neural Magic
1 year
SparseGPT uses a pruning mask to set weights not in the mask to 0 and the rest to their current value. In this article, we explore the internal workings of SparseGPT in more detail.
1
3
17
@billyG881
billyG88
1 year
@neuralmagic YYYEEEEESSSSSSS LETS GOOOOOO
0
0
1
@billyG881
billyG88
1 year
@neuralmagic Another game changer - LLMs will be everywhere, and running on every device in a few years... (they are already running on RasPis) HYYYYPEEEEE
0
0
4
@asda33681687
saurabh verma
1 year
@neuralmagic how much inference per second if 5 A100 80GB gpu?
0
0
1
@enjoypolosfu
enjoypolo🟠
1 year
@neuralmagic Cant wait to run this on my phone with data processed on-device.
0
0
0