Georgi Gerganov Profile Banner
Georgi Gerganov Profile
Georgi Gerganov

@ggerganov

38,771
Followers
244
Following
218
Media
1,256
Statuses

Not AI | 0x0e59 0x2550 24th at the Electrica puzzle challenge

Don't wanna be here? Send us removal request.
@ggerganov
Georgi Gerganov
1 year
Introducing LLaMA voice chat! 🦙 You can run this locally on an M1 Pro
190
1K
8K
@ggerganov
Georgi Gerganov
8 months
Casually running a 180B parameter LLM on M2 Ultra
83
412
4K
@ggerganov
Georgi Gerganov
1 year
I've started a company: From a fun side project just a few months ago, ggml has now become a useful library and framework for machine learning with a great open-source community
146
395
3K
@ggerganov
Georgi Gerganov
1 year
LLaMA voice chat + Siri TTS This example is now truly 100% offline since we are now using the built-in Siri text-to-speech available on MacOS through the "say" command
45
381
2K
@ggerganov
Georgi Gerganov
9 months
Full F16 precision 34B Code Llama at >20 t/s on M2 Ultra
40
270
2K
@ggerganov
Georgi Gerganov
10 months
ggtag : data-over-sound is back ! Please checkout our latest geeky side project -- An e-paper badge that can be programmed with sound Here is how it works 🔊
35
258
2K
@ggerganov
Georgi Gerganov
8 months
sam.cpp 👀 Inference of Meta's Segment Anything Model on the CPU Project by @YavorGI - powered by
35
283
2K
@ggerganov
Georgi Gerganov
10 months
guys it’s real
Tweet media one
46
65
2K
@ggerganov
Georgi Gerganov
2 months
Causally running Grok-1 at home
77
171
2K
@ggerganov
Georgi Gerganov
1 year
The future of on-device inference is ggml + Apple Silicon You heard it here first!
@natfriedman
Nat Friedman
1 year
Watching llama.cpp do 40 tok/s inference of the 7B model on my M2 Max, with 0% CPU usage, and using all 38 GPU cores. Congratulations @ggerganov ! This is a triumph.
115
761
5K
39
181
2K
@ggerganov
Georgi Gerganov
1 year
Simultaneously running LLaMA-7B (left) + Whisper Small (right) on M1 Pro
30
184
1K
@ggerganov
Georgi Gerganov
9 months
Let’s see what this rock can do
Tweet media one
52
27
1K
@ggerganov
Georgi Gerganov
5 months
Adding support for the new Mixtral models Runs on CPU, CUDA and Metal with quantization support and partial GPU offloading. Very interesting architecture to play with!
25
153
1K
@ggerganov
Georgi Gerganov
1 year
Announcing the Local LLaMA podcast 🎙️🦙 In today's episode we have LLaMA, GGaMA, SSaMA and RRaMA joining us to discuss the future of AI
32
188
1K
@ggerganov
Georgi Gerganov
6 months
Wrote a short tutorial for setting up llama.cpp on AWS instances For example, you can use one of the cheapest 16GB VRAM (NVIDIA T4) instances to serve a quantum Mistral 7B model to multiple clients in parallel with full context. Hope it is useful!
Tweet media one
28
174
1K
@ggerganov
Georgi Gerganov
5 months
ggml will soon run on billion devices @apple don't sleep on it 🙃
@rgerganov
Radoslav Gerganov
5 months
I just verified this on my Pixel 8 Pro phone! It has AICore included and it is using ggml
Tweet media one
Tweet media two
Tweet media three
5
26
267
61
131
1K
@ggerganov
Georgi Gerganov
6 months
Native whisper.cpp server with OAI-like API is now available $ make server && ./server This is a very convenient way to run an efficient local transcription service locally on any kind of hardware (CPU, GPU (CUDA or Metal) or ANE) thx felrock
25
153
1K
@ggerganov
Georgi Gerganov
7 months
llama.cpp server now support multimodal (LLaVA) 🎉 Huge shoutout to FSSRepo and monatis
16
137
1K
@ggerganov
Georgi Gerganov
7 months
👀 What is this black magic!?
22
139
1K
@ggerganov
Georgi Gerganov
1 year
Just added support for all LLaMA models I'm out of disk space, so if someone can give this a try for 33B and 65BB would be great 😄 See updated instructions in the Readme Here is LLaMA-13B at ~10 tokens/s
Tweet media one
@ggerganov
Georgi Gerganov
1 year
I think I can make 4-bit LLaMA-65B inference run on a 64 GB M1 Pro 🤔 Speed should be somewhere around 2 tokens/sec. Is this useful for anything?
37
17
452
26
141
1K
@ggerganov
Georgi Gerganov
1 year
llama.cpp just got access to the new Copilot for Pull Request technical preview by @github Just add tags like "copilot:all" / "copilot:summary" / "copilot:walkthrough" to your PR comment the magic happens 🪄
Tweet media one
Tweet media two
15
99
1K
@ggerganov
Georgi Gerganov
1 year
The llama.cpp repo is buzzing with activity today. Here are some highlights Added Alpaca model support and usage instructions
18
75
954
@ggerganov
Georgi Gerganov
10 months
llama2.c running in a web-page Compiled with Emscripten and modified the code to predict one token per render pass. The page auto-loads 50MB of model data - sorry about that 😄
@karpathy
Andrej Karpathy
10 months
My fun weekend hack: llama2.c 🦙🤠 Lets you train a baby Llama 2 model in PyTorch, then inference it with one 500-line file with no dependencies, in pure C. My pretrained model (on TinyStories) samples stories in fp32 at 18 tok/s on my MacBook Air M1 CPU.
Tweet media one
93
735
5K
16
151
905
@ggerganov
Georgi Gerganov
6 months
Here is how to deploy and serve any LLM on HF with a single command in less than 3 minutes with llama.cpp $ bash -c "$(curl -s )"
8
125
876
@ggerganov
Georgi Gerganov
10 months
llama.cpp now supports distributed inference across multiple devices via MPI This is possible thanks to @EvMill 's work. Looking for people to give this a try and attempt to run a 65B LLaMA on cluster of Raspberry Pis 🙃
19
142
878
@ggerganov
Georgi Gerganov
1 year
whisper.cpp v1.3.0 now with Core ML support Currently, the Encoder runs on the ANE, while the Decoder remains on the CPU. Check the linked PR 566 for implementation details and usage instructions
12
119
773
@ggerganov
Georgi Gerganov
1 year
Here is 4-bit inference of LLaMA-7B using ggml: Pure C/C++, runs on the CPU at 20 tokens/sec (M1 Pro) Generated text looks coherent, but quickly degrades - not sure if I have a bug or something 🤔 Anyway, LLaMA-65B on M1 coming soon!
24
131
751
@ggerganov
Georgi Gerganov
5 months
Running some LLM benches on iPhone 13 Mini This is 1.1B TinyLlama. Speed looks quite reasonable. Wonder what would be some cool applications that we can try out 🤔 P.S. Forget about useless chat bots - we want something else. Think grammar, function calling, etc.
Tweet media one
Tweet media two
50
70
740
@ggerganov
Georgi Gerganov
2 months
llama.cpp releases now ship with pre-built macOS binaries This should reduce the entry barrier for llama.cpp on Apple devices Thanks to @huggingface for the friendly support 🙏
Tweet media one
16
70
737
@ggerganov
Georgi Gerganov
1 year
I'm thinking about making an open-source local iOS voice chat app running Whisper Base + 4-bit Cerebras-GPT 2.7B. Should be able to run quite real-time on newer iPhones Pretty sure I have everything needed and can build this in a day. Only question is if Cerebras is good enough
43
41
735
@ggerganov
Georgi Gerganov
1 year
Apparently, Stable Diffusion can be used to generate images of spectrograms from text prompts. The spectrograms can in turn be converted to audio using STFT and some tricks Mind is blown!
Tweet media one
20
124
681
@ggerganov
Georgi Gerganov
9 months
Experimenting with speculative decoding + grammar sampling This is an example of summarizing a short story into a structured JSON. We again utilize speculative decoding, but this time we constrain the output using a JSON grammar to achieve > 95% token acceptance rate
11
71
682
@ggerganov
Georgi Gerganov
7 months
M2 Ultra serving Q8_0 LLaMA-v2 70B to 4 clients in parallel
16
70
667
@ggerganov
Georgi Gerganov
11 months
shower thought : drop the position embeddings, rewrite the transformer using complex numbers, encode the position information in the complex phase ref : see how MRI phase encoding works
32
26
645
@ggerganov
Georgi Gerganov
1 year
Top quality post on r/LocalLLaMA today 😅 Btw, great subreddit!
Tweet media one
9
55
647
@ggerganov
Georgi Gerganov
7 months
Serving 8 clients in parallel on A100 with llama.cpp Model: Codellama 7B F16 System prompt: 305 tokens Requests: 128 Max sequence length: 100 Continuous batching: enabled Average speed ~484 t/s (including prompts and generated tokens)
20
64
606
@ggerganov
Georgi Gerganov
7 months
llama.cpp is standing ground against the behemoths The CUDA backend is contained in a single C++ file so it allows for very easy deployment and custom modifications (pp - prefill, tg - text gen)
Tweet media one
@abacaj
anton
7 months
Trying out the new TensorRT-LLM framework and get some pretty good performance out of the box with 3090s. 107 tokens/sec int8 and 54 tok/sec bf16 for llama-2 7B models (not much work to setup either) Get 160+ tokens/sec on 2x3090s (these are just batch_size=1)
Tweet media one
17
29
270
12
47
584
@ggerganov
Georgi Gerganov
1 year
2,3,4,5 and 6-bit quantization methods are now available in llama.cpp Efficient inference implementation with ARM NEON, AVX2 and CUDA - see sample numbers in the screenshots Big thanks to ikawrakow for this contribution More info:
Tweet media one
Tweet media two
13
74
583
@ggerganov
Georgi Gerganov
8 months
Full GPU Metal inference with whisper.cpp This is the Medium model on M2 Ultra, greedy decoding
15
54
577
@ggerganov
Georgi Gerganov
1 month
Challenge accepted! 😀
@awnihannun
Awni Hannun
1 month
Achievement unlocked: 100 tokens-per-sec, 4-bit Mistral 7B in MLX on an M2 Ultra
Tweet media one
15
43
478
11
36
573
@ggerganov
Georgi Gerganov
2 months
The GGUF file format is a great example of the cool things that an open-source community can achieve. Props to @philpax_ and everyone else involved in the design and implementation of the format. I'm thankful and happy to see that it finds adoption in ML
@mishig25
Mishig Davaadorj
2 months
At @huggingface , we are adding more support to GGUF (model format by @ggerganov ). The number of GGUF models on the hub has been exploding & doesn't look like it is gonna slow down🔥 see more at:
Tweet media one
10
37
259
11
66
516
@ggerganov
Georgi Gerganov
1 year
Initial low-rank adaptation support has been added to llama.cpp We now have the option to apply LoRA adapters to a base model at runtime. Lots of room for improvements and opens up possibilities for some interesting applications
9
82
547
@ggerganov
Georgi Gerganov
9 months
Here are some inference numbers for Code Llama on M2 Ultra at different quantum levels using latest llama.cpp pp - prompt processing tg - text generation Code Llama 7B
Tweet media one
12
64
549
@ggerganov
Georgi Gerganov
9 months
The ggml roadmap is progressing as expected with a lot of infrastructural development already completed We now enter the more interesting phase of the project - applying the framework to practical problems and doing cool stuff on the Edge
Tweet media one
@ggerganov
Georgi Gerganov
11 months
Took the time to prepare a ggml development roadmap in the form of a Github Project This sets the priorities for the short/mid term and will offer a good way for everyone to keep track of the progress that is being made across related projects
Tweet media one
10
41
509
7
41
540
@ggerganov
Georgi Gerganov
1 year
Can't help but feel the AI hype is oriented in a non-optimal direction It's almost as if we had just discovered the FFT algorithm and instead of revolutionizing telecommunications, we are using it to build Tamagotchis P.S. I'm only half joking 😄
32
35
522
@ggerganov
Georgi Gerganov
11 months
Took the time to prepare a ggml development roadmap in the form of a Github Project This sets the priorities for the short/mid term and will offer a good way for everyone to keep track of the progress that is being made across related projects
Tweet media one
10
41
509
@ggerganov
Georgi Gerganov
1 year
I'm trying to figure out what this means Any ideas?
Tweet media one
40
32
505
@ggerganov
Georgi Gerganov
1 year
Progress update on adding Core ML support to whisper.cpp We can now run the small model with a 400ms time step quite efficiently thanks to evaluating the Encoder on the ANE
11
44
496
@ggerganov
Georgi Gerganov
7 months
Some of llama.cpp's features
Tweet media one
Tweet media two
Tweet media three
Tweet media four
13
37
486
@ggerganov
Georgi Gerganov
1 year
Interactive chat mode added to 🦙.cpp It actually works surprisingly well from the few tests that I tried! Kindly contributed by GH user Blackhole89
Tweet media one
12
44
482
@ggerganov
Georgi Gerganov
8 months
Initial tests with parallel decoding in llama.cpp A simulated server processing 64 client requests with 32 decoding streams on M2 Ultra. Supports hot-plugging of new sequences. Model is 30B LLaMA F16 ~4000 tokens (994 prompt + 3001 gen) with system prompt of 305 tokens in 46s
17
53
477
@ggerganov
Georgi Gerganov
1 year
Will be cancelling my Github Copilot subscription soon 🙃
@jamesravey
James Ravenscroft
1 year
Introducing TurboPilot - my #copilot clone that runs a 6 billion parameter code completion #LLM on your laptop in 4GB of RAM - no GPU required. You can integrate it directly into Visual Studio and be up and running in a few minutes.
26
333
2K
9
33
465
@ggerganov
Georgi Gerganov
1 year
Here is what a properly built llama.cpp looks like Running 7B on 2 years old Pixel 5 at 1 token/sec. Would be interesting to see how an interactive session feels like
@rgerganov
Radoslav Gerganov
1 year
Running llama.cpp on my Pixel5 phone with termux. Kudos to @ggerganov !
Tweet media one
12
34
295
11
69
460
@ggerganov
Georgi Gerganov
1 year
I think I can make 4-bit LLaMA-65B inference run on a 64 GB M1 Pro 🤔 Speed should be somewhere around 2 tokens/sec. Is this useful for anything?
37
17
452
@ggerganov
Georgi Gerganov
1 month
GGUF My Repo by @huggingface Create quantum GGUF models fully online - quickly and secure. Thanks to @reach_vb , @pcuenq and team for creating this HF space! In the video below I give it a try to create a quantum 8-bit model of Gemma 2B - it took about
24
90
462
@ggerganov
Georgi Gerganov
9 months
ROCm support in llama.cpp 4 months community effort enables AMD devices to run quantum LLMs with high efficiency. Really great to see the strong collaboration in this work!
11
66
452
@ggerganov
Georgi Gerganov
6 months
Very clever stuff! Will be adding a llama.cpp example soon
@lmsysorg
lmsys.org
6 months
Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step. Blog: Code:
23
250
1K
7
39
454
@ggerganov
Georgi Gerganov
1 year
I'm color-coding Whisper tokens based on their probs -- green means confident. All models behave in a similar way (first 3 images), except for Large V2. The probs are all over the place (4th image) 🤔 Do I have a bug or is this model somehow unstable?
Tweet media one
Tweet media two
Tweet media three
Tweet media four
15
30
449
@ggerganov
Georgi Gerganov
1 year
The plan for adding full-fledged GPU support in ggml is starting to take shape Today I finally finished the ggml computation graph export / import functionality and demonstrated a basic MNIST inference on the Apple Silicon GPU using Metal
8
65
443
@ggerganov
Georgi Gerganov
1 year
4-bit integer quantisation in whisper.cpp / ggml You can now run the Large Whisper model locally in a web page via WebAssembly SIMD
Tweet media one
11
66
441
@ggerganov
Georgi Gerganov
10 months
Very cool experiment by @chillgates_ Distributed MPI inference using llama.cpp with 6 Raspberry Pis - each one with 8GB RAM "sees" 1/6 of the entire 65B model. Inference starts around ~1:10 Follow the progress here:
@chillgates_
Loki (cute/acc)
10 months
Yeah. I have ChatGPT at home. Not a silly 7b model. A full-on 65B model that runs on my pi cluster, watch how the model gets loaded across the cluster with mmap and does round-robin inferencing 🫡 (10 seconds/token) (sped up 16x)
85
188
2K
12
75
437
@ggerganov
Georgi Gerganov
1 year
This is the prompt for anyone interested:
8
51
432
@ggerganov
Georgi Gerganov
10 months
napkin math ahead: - buy 8 mac mini (200GB/s, ~$1.2k each) - run LLAMA_METAL=1 LLAMA_MPI=1 for interleaved pipeline inference - deploy on-premise, serve up to 8 clients in parallel at 25 t/s / 4-bit / 7B is this cost efficient? energy wise? thanks to @stanimirovb for idea
26
26
418
@ggerganov
Georgi Gerganov
1 year
The new image segmentation model SAM by Meta looks extremely interesting
16
14
415
@ggerganov
Georgi Gerganov
7 months
This is LLaVA 7B v1.5 running on M2 Ultra thanks to the amazing work of GH user monatis I'm surprised this works so well - downloaded a few photos from my phone and every single one was accurately described. Mind is blown!
4
38
416
@ggerganov
Georgi Gerganov
3 months
"inference on your head"
@josephsemrai
Joseph Semrai
3 months
inference on your head mistral 7b (4bit quantized) running locally on apple vision pro
38
118
1K
6
32
413
@ggerganov
Georgi Gerganov
9 months
"Wait, Georgi, how is this even possible?" you might ask. After all, the M2 Ultra only has 800GB/s bandwidth. Other people normally need 4 high-end GPUs to do this The answer is: Speculative Sampling
9
38
404
@ggerganov
Georgi Gerganov
1 year
RWKV port in ggml by the community: I haven't had the chance to look at this in details yet, but it feels great that people are picking up ggml and applying it to more and more models
3
54
388
@ggerganov
Georgi Gerganov
1 year
Here I outline a potential strategy for adding GPU support to ggml Not sure how feasible it is yet, but it could be a fun exercise for people with GPU programming experience
6
49
385
@ggerganov
Georgi Gerganov
1 year
Powered by: ggml / whisper.cpp / llama.cpp / Core ML STT: Whisper Small LLM: 13B LLaMA TTS: @elevenlabsio The Whisper Encoder is running on Apple Neural Engine. Everything else is optimized via ARM NEON and Apple Accelerate
10
18
370
@ggerganov
Georgi Gerganov
5 months
Playing some chess using voice WASM whisper.cpp with a quantized tiny model + grammar sampling (by @ejones ). Runs locally in the browser Not perfect, but I think pretty good overall! Try it here:
8
44
360
@ggerganov
Georgi Gerganov
9 months
70B home assistant running on M2 Ultra at 15 t/s I can now cancel my ChatGPT API subscription
Tweet media one
Tweet media two
20
24
356
@ggerganov
Georgi Gerganov
1 year
To run the released model with latest llama.cpp, use the "convert-unversioned-ggml-to-ggml" python script and apply the following patch to llama.cpp The latest llama.cpp offers significant performance and accuracy improvements in the inference computation
Tweet media one
@andriy_mulyar
AndriyMulyar
1 year
I'm excited to announce the release of GPT4All, a 7B param language model finetuned from a curated set of 400k GPT-Turbo-3.5 assistant-style generation. We release💰800k data samples💰 for anyone to build upon and a model you can run on your laptop! Real-time Sampling on M1 Mac
161
986
7K
6
47
353
@ggerganov
Georgi Gerganov
1 year
Added a ggml example for using Cerebras-GPT I think the sampling needs some work because I can't get it to generate coherent stuff yet. Using quantized 6.7B model Here is the code and usage instructions if you want to play with it:
Tweet media one
7
40
349
@ggerganov
Georgi Gerganov
1 year
So, someone just DM'd me on twitter a patch that improves the inference time by 10% on ARM NEON (i.e. Apple Silicon) Probably more people should go there and optimise this stuff
5
26
342
@ggerganov
Georgi Gerganov
11 months
600 posts per day actually sound great for me. Some days I do feel I waste too much time here
12
15
339
@ggerganov
Georgi Gerganov
5 months
Some performance stats for llama.cpp on A-series chips (iPhone / iPad) We are collecting benchmarks for 1B, 3B and 7B models at different quantization levels. Can be used as a reference for the expected LLM performance on these devices
5
46
332
@ggerganov
Georgi Gerganov
9 months
Code Llama 34B using Q4_K_M quantization on a MacBook
@rjghik
Roman Janusz
9 months
@abacaj Quantized version running on MBP M2 Max (llama.cpp)
5
6
74
7
34
330
@ggerganov
Georgi Gerganov
1 year
Some good old airplane-mode programming No copilot, no voice control, no AR/VR, no AI augmentations, no cybernetic implants Just VIM and the sunrise
Tweet media one
10
8
315
@ggerganov
Georgi Gerganov
1 year
Great write-up The CoreML branch speeds up just the Encoder. At the same time, the master branch already has additional ~2-3 factor of speed up in the Decoder thanks to recent work on llama.cpp When we merge these 2 together, the performance will be mind blowing
@bjnortier
Ben Nortier
1 year
Hello Transcribe 2.2 with CoreML is out, now 3x-7x faster 🚀🥳 Blog post: App Store: #OpenAI #AI #Whisper #CoreML
Tweet media one
4
14
134
9
29
320
@ggerganov
Georgi Gerganov
6 months
What is the intuition of having the LLaMA layers to be the same size instead of, let's say increasing size: start with small hidden state and keep increasing as you go through the layers? At layer 1, we have just a single token with no context - 4096 hidden state seems too big
28
21
310
@ggerganov
Georgi Gerganov
1 year
And one more example using Mac OS "say" command
10
21
313
@ggerganov
Georgi Gerganov
11 months
First results on M2 Ultra using llama.cpp
Tweet media one
7
14
304
@ggerganov
Georgi Gerganov
1 year
I'm incredibly grateful to @natfriedman and @danielgross for the support & funding and also for helping me get inspired even more in this project There is still a long way ahead with many ideas to try and cool things to do. Hope you will join and help us create something useful!
13
5
305
@ggerganov
Georgi Gerganov
9 months
Similar test with M2 Ultra using vanilla 70B LLaMA Prompt processing is currently unoptimized, but just the text generation yields ~13 tok/s. This is Q4_0 quantization (~39GB) using Metal
Tweet media one
@Teknium1
Teknium (e/λ)
9 months
Llama-70B (StableBeluga2-70B) inference at home with exllama:
Tweet media one
8
10
137
13
28
297
@ggerganov
Georgi Gerganov
6 months
We've gathered some performance stats of llama.cpp on Apple Silicon. Can be useful to compare the speed across the M-series chips
Tweet media one
9
24
293
@ggerganov
Georgi Gerganov
6 months
Don't stop there - add Whisper to talk to it
@Karmedge
Robert Lukoszko — e/acc
6 months
So I set up BakLLaVA-1 in the llama.cpp, and now it can provide real-time descriptions of the live feed from my camera check it out! open source? cc: @nisten @thursdai_pod @willdepue #llama
62
127
866
7
23
292
@ggerganov
Georgi Gerganov
2 months
Here is a thread with a (small) part of the exciting developments in llama.cpp over that past month
Tweet media one
Tweet media two
Tweet media three
6
33
287
@ggerganov
Georgi Gerganov
1 year
I will soon be looking into hiring people to work full-time on ggml, and I'm also interested in hearing from companies that want to use ggml commercially and need technical support Contact jobs @ggml .ai and sales @ggml .ai
6
11
286
@ggerganov
Georgi Gerganov
5 months
Here is a quick demo with the 4-bit base model on M2 Ultra You can run this on your MacBooks - the quantized models are between 15GB ~ 50GB. The prompt processing speed is subpar atm, but we will improve this
15
26
279
@ggerganov
Georgi Gerganov
6 months
Love this demo! - Using the batched decoding API the continuations can be computed faster - Probably a longer context would be helpful. Could be computed in the background while typing
@dylfreed
Dylan Freedman
6 months
Prototyping a real-time AI writing tool to show how large language models are essentially probability engines. (thanks to @ggerganov 's llama.cpp for enabling this to run rapidly on an 8GB RAM MacBook Air)
16
39
399
2
34
277
@ggerganov
Georgi Gerganov
1 year
Basic MNIST inference example with ggml Thanks to @cromwellian for the contribution
Tweet media one
Tweet media two
5
29
271