Georgi Gerganov @ggerganov Twitter profile

Last Seen Profiles

@Unibocconi

@CIFNocetti

@AlGhanadreh_2

@MHNpp

@alc2022

@wanjiru_njiru

@TCEOPERA

@AstrArk_World

@MissMDK1

@AveryChing

@kiariflow

@Outlyr

@jaylew333

@CheikhounaMbgte

@kuroshiba_ritsu

@NotSoLilMD

@shinescreens

@VFWHQ

@yebi_suki

@manicmanic_

@MuhonenTeemu

@annLorryone

@laestatuilla

@FrannyBenali

@Khufiia

@ardammagic

@vaChizengeni

@CoachPJ_Arcuri

@pecnting

@TelegraphNews

@beat102103

@HPOUTX

@flo_wershower

@faktosin

@caerphilly

@Hacaosuka_Works

Georgi Gerganov

@ggerganov

1 year

Introducing LLaMA voice chat! 🦙 You can run this locally on an M1 Pro

190

1K

8K

Georgi Gerganov

@ggerganov

8 months

Casually running a 180B parameter LLM on M2 Ultra

83

412

4K

Georgi Gerganov

@ggerganov

1 year

I've started a company: From a fun side project just a few months ago, ggml has now become a useful library and framework for machine learning with a great open-source community

146

395

3K

Georgi Gerganov

@ggerganov

1 year

LLaMA voice chat + Siri TTS This example is now truly 100% offline since we are now using the built-in Siri text-to-speech available on MacOS through the "say" command

45

381

2K

Georgi Gerganov

@ggerganov

9 months

Full F16 precision 34B Code Llama at >20 t/s on M2 Ultra

40

270

2K

Georgi Gerganov

@ggerganov

10 months

ggtag : data-over-sound is back ! Please checkout our latest geeky side project -- An e-paper badge that can be programmed with sound Here is how it works 🔊

35

258

2K

Georgi Gerganov

@ggerganov

8 months

sam.cpp 👀 Inference of Meta's Segment Anything Model on the CPU Project by @YavorGI - powered by

35

283

2K

Georgi Gerganov

@ggerganov

10 months

guys it’s real

46

65

2K

Georgi Gerganov

@ggerganov

2 months

Causally running Grok-1 at home

77

171

2K

Georgi Gerganov

@ggerganov

1 year

The future of on-device inference is ggml + Apple Silicon You heard it here first!

Nat Friedman

@natfriedman

1 year

Watching llama.cpp do 40 tok/s inference of the 7B model on my M2 Max, with 0% CPU usage, and using all 38 GPU cores. Congratulations @ggerganov ! This is a triumph.

115

761

5K

39

181

2K

Georgi Gerganov

@ggerganov

1 year

Simultaneously running LLaMA-7B (left) + Whisper Small (right) on M1 Pro

30

184

1K

Georgi Gerganov

@ggerganov

9 months

Let’s see what this rock can do

52

27

1K

Georgi Gerganov

@ggerganov

5 months

Adding support for the new Mixtral models Runs on CPU, CUDA and Metal with quantization support and partial GPU offloading. Very interesting architecture to play with!

llama : add Mixtral support by slaren · Pull Request #4406 · ggerganov/llama.cpp

close #4381 Description Add initial support for Mixture-of-Experts (MoE) LLM architectures. Support for quantization and partial GPU offloading is available. 289443553-a3d5c7e3-d...

github.com

25

153

1K

Georgi Gerganov

@ggerganov

1 year

Announcing the Local LLaMA podcast 🎙️🦙 In today's episode we have LLaMA, GGaMA, SSaMA and RRaMA joining us to discuss the future of AI

32

188

1K

Georgi Gerganov

@ggerganov

6 months

Wrote a short tutorial for setting up llama.cpp on AWS instances For example, you can use one of the cheapest 16GB VRAM (NVIDIA T4) instances to serve a quantum Mistral 7B model to multiple clients in parallel with full context. Hope it is useful!

28

174

1K

Georgi Gerganov

@ggerganov

5 months

ggml will soon run on billion devices @apple don't sleep on it 🙃

Radoslav Gerganov

@rgerganov

5 months

I just verified this on my Pixel 8 Pro phone! It has AICore included and it is using ggml

5

26

267

61

131

1K

Georgi Gerganov

@ggerganov

6 months

Native whisper.cpp server with OAI-like API is now available $ make server && ./server This is a very convenient way to run an efficient local transcription service locally on any kind of hardware (CPU, GPU (CUDA or Metal) or ANE) thx felrock

25

153

1K

Georgi Gerganov

@ggerganov

7 months

llama.cpp server now support multimodal (LLaVA) 🎉 Huge shoutout to FSSRepo and monatis

From the LocalLLaMA community on Reddit

Explore this post and more from the LocalLLaMA community

www.reddit.com

16

137

1K

Georgi Gerganov

@ggerganov

7 months

👀 What is this black magic!?

22

139

1K

Georgi Gerganov

@ggerganov

1 year

Just added support for all LLaMA models I'm out of disk space, so if someone can give this a try for 33B and 65BB would be great 😄 See updated instructions in the Readme Here is LLaMA-13B at ~10 tokens/s

Georgi Gerganov

@ggerganov

1 year

I think I can make 4-bit LLaMA-65B inference run on a 64 GB M1 Pro 🤔 Speed should be somewhere around 2 tokens/sec. Is this useful for anything?

37

17

452

26

141

1K

Georgi Gerganov

@ggerganov

1 year

llama.cpp just got access to the new Copilot for Pull Request technical preview by @github Just add tags like "copilot:all" / "copilot:summary" / "copilot:walkthrough" to your PR comment the magic happens 🪄

15

99

1K

Georgi Gerganov

@ggerganov

1 year

The llama.cpp repo is buzzing with activity today. Here are some highlights Added Alpaca model support and usage instructions

18

75

954

Georgi Gerganov

@ggerganov

10 months

llama2.c running in a web-page Compiled with Emscripten and modified the code to predict one token per render pass. The page auto-loads 50MB of model data - sorry about that 😄

Andrej Karpathy

@karpathy

10 months

My fun weekend hack: llama2.c 🦙🤠 Lets you train a baby Llama 2 model in PyTorch, then inference it with one 500-line file with no dependencies, in pure C. My pretrained model (on TinyStories) samples stories in fp32 at 18 tok/s on my MacBook Air M1 CPU.

93

735

5K

16

151

905

Georgi Gerganov

@ggerganov

6 months

Here is how to deploy and serve any LLM on HF with a single command in less than 3 minutes with llama.cpp $ bash -c "$(curl -s )"

8

125

876

Georgi Gerganov

@ggerganov

10 months

llama.cpp now supports distributed inference across multiple devices via MPI This is possible thanks to @EvMill 's work. Looking for people to give this a try and attempt to run a 65B LLaMA on cluster of Raspberry Pis 🙃

Distributed inference via MPI by evanmiller · Pull Request #2099 · ggerganov/llama.cpp

Model inference is currently limited by the memory on a single node. Using MPI, we can distribute models across a locally networked cluster of machines. This PR uses a ring pipeline architecture so...

github.com

19

142

878

Georgi Gerganov

@ggerganov

1 year

whisper.cpp v1.3.0 now with Core ML support Currently, the Encoder runs on the ANE, while the Decoder remains on the CPU. Check the linked PR 566 for implementation details and usage instructions

Release v1.3.0 · ggerganov/whisper.cpp

Overview This release should be considered in Beta stage, since I haven't done a lot of testing and I am not sure if I didn't break something. But overall, I believe both the performance an...

github.com

12

119

773

Georgi Gerganov

@ggerganov

1 year

Here is 4-bit inference of LLaMA-7B using ggml: Pure C/C++, runs on the CPU at 20 tokens/sec (M1 Pro) Generated text looks coherent, but quickly degrades - not sure if I have a bug or something 🤔 Anyway, LLaMA-65B on M1 coming soon!

GitHub - ggerganov/llama.cpp: LLM inference in C/C++

LLM inference in C/C++. Contribute to ggerganov/llama.cpp development by creating an account on GitHub.

github.com

24

131

751

Georgi Gerganov

@ggerganov

5 months

Running some LLM benches on iPhone 13 Mini This is 1.1B TinyLlama. Speed looks quite reasonable. Wonder what would be some cool applications that we can try out 🤔 P.S. Forget about useless chat bots - we want something else. Think grammar, function calling, etc.

50

70

740

Georgi Gerganov

@ggerganov

2 months

llama.cpp releases now ship with pre-built macOS binaries This should reduce the entry barrier for llama.cpp on Apple devices Thanks to @huggingface for the friendly support 🙏

16

70

737

Georgi Gerganov

@ggerganov

1 year

I'm thinking about making an open-source local iOS voice chat app running Whisper Base + 4-bit Cerebras-GPT 2.7B. Should be able to run quite real-time on newer iPhones Pretty sure I have everything needed and can build this in a day. Only question is if Cerebras is good enough

43

41

735

Georgi Gerganov

@ggerganov

1 year

Apparently, Stable Diffusion can be used to generate images of spectrograms from text prompts. The spectrograms can in turn be converted to audio using STFT and some tricks Mind is blown!

20

124

681

Georgi Gerganov

@ggerganov

9 months

Experimenting with speculative decoding + grammar sampling This is an example of summarizing a short story into a structured JSON. We again utilize speculative decoding, but this time we constrain the output using a JSON grammar to achieve > 95% token acceptance rate

11

71

682

Georgi Gerganov

@ggerganov

7 months

M2 Ultra serving Q8_0 LLaMA-v2 70B to 4 clients in parallel

16

70

667

Georgi Gerganov

@ggerganov

11 months

shower thought : drop the position embeddings, rewrite the transformer using complex numbers, encode the position information in the complex phase ref : see how MRI phase encoding works

32

26

645

Georgi Gerganov

@ggerganov

1 year

Top quality post on r/LocalLLaMA today 😅 Btw, great subreddit!

9

55

647

Georgi Gerganov

@ggerganov

3 months

Run @Google 's Gemma Open Models with llama.cpp

Add `gemma` model by postmasters · Pull Request #5631 · ggerganov/llama.cpp

There are couple things in this architecture: Shared input and output embedding parameters. Key length and value length are not derived from n_embd. More information about the models can be found...

github.com

22

80

649

Georgi Gerganov

@ggerganov

6 months

whisper.cpp v1.5.0

Release v1.5.0 · ggerganov/whisper.cpp

Overview This major release includes the following changes: Full GPU processing of the Encoder and the Decoder with CUDA and Metal is now supported Efficient beam-search implementation via batched...

github.com

15

74

603

Georgi Gerganov

@ggerganov

7 months

Serving 8 clients in parallel on A100 with llama.cpp Model: Codellama 7B F16 System prompt: 305 tokens Requests: 128 Max sequence length: 100 Continuous batching: enabled Average speed ~484 t/s (including prompts and generated tokens)

20

64

606

Georgi Gerganov

@ggerganov

7 months

llama.cpp is standing ground against the behemoths The CUDA backend is contained in a single C++ file so it allows for very easy deployment and custom modifications (pp - prefill, tg - text gen)

anton

@abacaj

7 months

Trying out the new TensorRT-LLM framework and get some pretty good performance out of the box with 3090s. 107 tokens/sec int8 and 54 tok/sec bf16 for llama-2 7B models (not much work to setup either) Get 160+ tokens/sec on 2x3090s (these are just batch_size=1)

17

29

270

12

47

584

Georgi Gerganov

@ggerganov

1 year

2,3,4,5 and 6-bit quantization methods are now available in llama.cpp Efficient inference implementation with ARM NEON, AVX2 and CUDA - see sample numbers in the screenshots Big thanks to ikawrakow for this contribution More info:

13

74

583

Georgi Gerganov

@ggerganov

8 months

Full GPU Metal inference with whisper.cpp This is the Medium model on M2 Ultra, greedy decoding

15

54

577

Georgi Gerganov

@ggerganov

1 month

Challenge accepted! 😀

Awni Hannun

@awnihannun

1 month

Achievement unlocked: 100 tokens-per-sec, 4-bit Mistral 7B in MLX on an M2 Ultra

15

43

478

11

36

573

Georgi Gerganov

@ggerganov

2 months

The GGUF file format is a great example of the cool things that an open-source community can achieve. Props to @philpax_ and everyone else involved in the design and implementation of the format. I'm thankful and happy to see that it finds adoption in ML

GGUF file format specification by philpax · Pull Request #302 · ggerganov/ggml

Closes #220. Rendered: https://github.com/philpax/ggml/blob/gguf-spec/docs/gguf.md Defines a complete specification for the proposed GGUF file format, which should generically describe models to be...

github.com

Mishig Davaadorj

@mishig25

2 months

At @huggingface , we are adding more support to GGUF (model format by @ggerganov ). The number of GGUF models on the hub has been exploding & doesn't look like it is gonna slow down🔥 see more at:

10

37

259

11

66

516

Georgi Gerganov

@ggerganov

1 year

Initial low-rank adaptation support has been added to llama.cpp We now have the option to apply LoRA adapters to a base model at runtime. Lots of room for improvements and opens up possibilities for some interesting applications

Add LoRA support by slaren · Pull Request #820 · ggerganov/llama.cpp

This change allows applying LoRA adapters on the fly without having to duplicate the model files. Instructions: Obtain the HF PEFT LoRA files adapter_config.json and adapter_model.bin of a LoRA ad...

github.com

9

82

547

Georgi Gerganov

@ggerganov

9 months

Here are some inference numbers for Code Llama on M2 Ultra at different quantum levels using latest llama.cpp pp - prompt processing tg - text generation Code Llama 7B

12

64

549

Georgi Gerganov

@ggerganov

9 months

The ggml roadmap is progressing as expected with a lot of infrastructural development already completed We now enter the more interesting phase of the project - applying the framework to practical problems and doing cool stuff on the Edge

Georgi Gerganov

@ggerganov

11 months

Took the time to prepare a ggml development roadmap in the form of a Github Project This sets the priorities for the short/mid term and will offer a good way for everyone to keep track of the progress that is being made across related projects

10

41

509

7

41

540

Georgi Gerganov

@ggerganov

1 year

Can't help but feel the AI hype is oriented in a non-optimal direction It's almost as if we had just discovered the FFT algorithm and instead of revolutionizing telecommunications, we are using it to build Tamagotchis P.S. I'm only half joking 😄

32

35

522

Georgi Gerganov

@ggerganov

11 months

Took the time to prepare a ggml development roadmap in the form of a Github Project This sets the priorities for the short/mid term and will offer a good way for everyone to keep track of the progress that is being made across related projects

10

41

509

Georgi Gerganov

@ggerganov

1 year

New release: whisper.cpp v1.4 - Added 4-bit, 5-bit and 8-bit integer quantization - Added partial GPU support via cuBLAS

Release v1.4.0 · ggerganov/whisper.cpp

Overview This is a new major release adding integer quantization and partial GPU (NVIDIA) support Integer quantization This allows the ggml Whisper models to be converted from the default 16-bit fl...

github.com

11

65

514

Georgi Gerganov

@ggerganov

1 year

I'm trying to figure out what this means Any ideas?

40

32

505

Georgi Gerganov

@ggerganov

11 months

whisper.cpp now supports @akashmjn 's tinydiarize models These fine-tuned models offer experimental support for speaker segmentation by introducing special tokens for marking speaker changes

whisper : support speaker segmentation (local diarization) of mono audio via tinydiarize by...

As discussed in #64, this PR adds experimental support for local diarization (marking of speaker turns) via integration of checkpoints from this project https://github.com/akashmjn/tinydiarize/tree...

github.com

16

66

509

Georgi Gerganov

@ggerganov

1 year

Progress update on adding Core ML support to whisper.cpp We can now run the small model with a 400ms time step quite efficiently thanks to evaluating the Encoder on the ANE

11

44

496

Georgi Gerganov

@ggerganov

7 months

Some of llama.cpp's features

13

37

486

Georgi Gerganov

@ggerganov

1 year

Interactive chat mode added to 🦙.cpp It actually works surprisingly well from the few tests that I tried! Kindly contributed by GH user Blackhole89

12

44

482

Georgi Gerganov

@ggerganov

8 months

Initial tests with parallel decoding in llama.cpp A simulated server processing 64 client requests with 32 decoding streams on M2 Ultra. Supports hot-plugging of new sequences. Model is 30B LLaMA F16 ~4000 tokens (994 prompt + 3001 gen) with system prompt of 305 tokens in 46s

17

53

477

Georgi Gerganov

@ggerganov

1 year

Will be cancelling my Github Copilot subscription soon 🙃

James Ravenscroft

@jamesravey

1 year

Introducing TurboPilot - my #copilot clone that runs a 6 billion parameter code completion #LLM on your laptop in 4GB of RAM - no GPU required. You can integrate it directly into Visual Studio and be up and running in a few minutes.

26

333

2K

9

33

465

Georgi Gerganov

@ggerganov

1 year

Here is what a properly built llama.cpp looks like Running 7B on 2 years old Pixel 5 at 1 token/sec. Would be interesting to see how an interactive session feels like

Radoslav Gerganov

@rgerganov

1 year

Running llama.cpp on my Pixel5 phone with termux. Kudos to @ggerganov !

12

34

295

11

69

460

Georgi Gerganov

@ggerganov

1 year

I think I can make 4-bit LLaMA-65B inference run on a 64 GB M1 Pro 🤔 Speed should be somewhere around 2 tokens/sec. Is this useful for anything?

37

17

452

Georgi Gerganov

@ggerganov

1 month

GGUF My Repo by @huggingface Create quantum GGUF models fully online - quickly and secure. Thanks to @reach_vb , @pcuenq and team for creating this HF space! In the video below I give it a try to create a quantum 8-bit model of Gemma 2B - it took about

24

90

462

Georgi Gerganov

@ggerganov

9 months

ROCm support in llama.cpp 4 months community effort enables AMD devices to run quantum LLMs with high efficiency. Really great to see the strong collaboration in this work!

ROCm Port by SlyEcho · Pull Request #1087 · ggerganov/llama.cpp

Currently I can say that for regular users the CLBlast version is much easier to run. If you want the most performance, though, HIP is for you. Remember to tweak the new settings LLAMA_CUDA_DMMV_X...

github.com

11

66

452

Georgi Gerganov

@ggerganov

6 months

Very clever stuff! Will be adding a llama.cpp example soon

lmsys.org

@lmsysorg

6 months

Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step. Blog: Code:

23

250

1K

7

39

454

Georgi Gerganov

@ggerganov

1 year

I'm color-coding Whisper tokens based on their probs -- green means confident. All models behave in a similar way (first 3 images), except for Large V2. The probs are all over the place (4th image) 🤔 Do I have a bug or is this model somehow unstable?

15

30

449

Georgi Gerganov

@ggerganov

1 year

The plan for adding full-fledged GPU support in ggml is starting to take shape Today I finally finished the ggml computation graph export / import functionality and demonstrated a basic MNIST inference on the Apple Silicon GPU using Metal

ggml : cgraph export/import/eval example + GPU support by ggerganov · Pull Request #108 · ggergan...

This is the first step towards full GPU and custom hardware inference support (see ggerganov/llama.cpp#915) The idea is to be able to export the ggml computation graphs (ggml_cgraph) into standalon...

github.com

8

65

443

Georgi Gerganov

@ggerganov

1 year

4-bit integer quantisation in whisper.cpp / ggml You can now run the Large Whisper model locally in a web page via WebAssembly SIMD

11

66

441

Georgi Gerganov

@ggerganov

10 months

Very cool experiment by @chillgates_ Distributed MPI inference using llama.cpp with 6 Raspberry Pis - each one with 8GB RAM "sees" 1/6 of the entire 65B model. Inference starts around ~1:10 Follow the progress here:

mpi : attempt inference of 65B LLaMA on a cluster of Raspberry Pis · Issue #2164 · ggerganov/llam...

Now that distributed inference is supported thanks to the work of @evanmiller in #2099 it would be fun to try to utilize it for something cool. One such idea is to connect a bunch of Raspberry Pis ...

github.com

Loki (cute/acc)

@chillgates_

10 months

Yeah. I have ChatGPT at home. Not a silly 7b model. A full-on 65B model that runs on my pi cluster, watch how the model gets loaded across the cluster with mmap and does round-robin inferencing 🫡 (10 seconds/token) (sped up 16x)

85

188

2K

12

75

437

Georgi Gerganov

@ggerganov

1 year

This is the prompt for anyone interested:

8

51

432

Georgi Gerganov

@ggerganov

10 months

napkin math ahead: - buy 8 mac mini (200GB/s, ~$1.2k each) - run LLAMA_METAL=1 LLAMA_MPI=1 for interleaved pipeline inference - deploy on-premise, serve up to 8 clients in parallel at 25 t/s / 4-bit / 7B is this cost efficient? energy wise? thanks to @stanimirovb for idea

26

418

Georgi Gerganov

@ggerganov

1 year

The new image segmentation model SAM by Meta looks extremely interesting

16

14

415

Georgi Gerganov

@ggerganov

7 months

This is LLaVA 7B v1.5 running on M2 Ultra thanks to the amazing work of GH user monatis I'm surprised this works so well - downloaded a few photos from my phone and every single one was accurately described. Mind is blown!

Implement multimodal models (LLaVA) by monatis · Pull Request #3436 · ggerganov/llama.cpp

closes #3332 This is still WIP and highly experimental. The work started in lmm.cpp, but it turned out to be also ok to implement it in this repo, which I believe will be much simpler. The plan is ...

github.com

4

38

416

Georgi Gerganov

@ggerganov

3 months

"inference on your head"

Joseph Semrai

@josephsemrai

3 months

inference on your head mistral 7b (4bit quantized) running locally on apple vision pro

38

118

1K

6

32

413

Georgi Gerganov

@ggerganov

9 months

"Wait, Georgi, how is this even possible?" you might ask. After all, the M2 Ultra only has 800GB/s bandwidth. Other people normally need 4 high-end GPUs to do this The answer is: Speculative Sampling

speculative : PoC for speeding-up inference via speculative sampling by ggerganov · Pull Request...

ref: #2030 Initial results with the following config indicate a factor of x2 speed-up: target model: Code Llama 34B F16 draft model: Code Llama 7B Q4_10 Todo: Refactor sampling code and reuse b...

github.com

9

38

404

Georgi Gerganov

@ggerganov

1 year

RWKV port in ggml by the community: I haven't had the chance to look at this in details yet, but it feels great that people are picking up ggml and applying it to more and more models

GitHub - RWKV/rwkv.cpp: INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model

INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model - RWKV/rwkv.cpp

github.com

3

54

388

Georgi Gerganov

@ggerganov

1 year

Here I outline a potential strategy for adding GPU support to ggml Not sure how feasible it is yet, but it could be a fun exercise for people with GPU programming experience

Add GPU support to ggml · ggerganov llama.cpp · Discussion #915

Intro This issue is more suitable for the https://github.com/ggerganov/ggml repo, but adding it here for more visibility. First, I don't see adding a GPU framework that is tightly integrated wi...

github.com

6

49

385

Georgi Gerganov

@ggerganov

1 year

Powered by: ggml / whisper.cpp / llama.cpp / Core ML STT: Whisper Small LLM: 13B LLaMA TTS: @elevenlabsio The Whisper Encoder is running on Apple Neural Engine. Everything else is optimized via ARM NEON and Apple Accelerate

10

18

370

Georgi Gerganov

@ggerganov

5 months

Playing some chess using voice WASM whisper.cpp with a quantized tiny model + grammar sampling (by @ejones ). Runs locally in the browser Not perfect, but I think pretty good overall! Try it here:

8

44

360

Georgi Gerganov

@ggerganov

9 months

70B home assistant running on M2 Ultra at 15 t/s I can now cancel my ChatGPT API subscription

20

24

356

Georgi Gerganov

@ggerganov

1 year

To run the released model with latest llama.cpp, use the "convert-unversioned-ggml-to-ggml" python script and apply the following patch to llama.cpp The latest llama.cpp offers significant performance and accuracy improvements in the inference computation

AndriyMulyar

@andriy_mulyar

1 year

I'm excited to announce the release of GPT4All, a 7B param language model finetuned from a curated set of 400k GPT-Turbo-3.5 assistant-style generation. We release💰800k data samples💰 for anyone to build upon and a model you can run on your laptop! Real-time Sampling on M1 Mac

161

986

7K

6

47

353

Georgi Gerganov

@ggerganov

6 months

Here are some Apple Silicon stats for llama.cpp You can use these numbers to estimate the performance that you would get on your Mac for typical bs=1 cases

Performance of llama.cpp on Apple Silicon M-series · ggerganov llama.cpp · Discussion #4167

Summary LLaMA 7B BW [GB/s] GPU Cores F16 PP [t/s] F16 TG [t/s] Q8_0 PP [t/s] Q8_0 TG [t/s] Q4_0 PP [t/s] Q4_0 TG [t/s] ✅ M1 1 68 7 108.21 7.92 107.81 14.19 ✅ M1 1 68 8 117.25 7.91 117.96 14.15 ✅ M1...

github.com

10

51

353

Georgi Gerganov

@ggerganov

10 months

Lets add support for llama2.c models to llama.cpp

llama : add support for llama2.c models · Issue #2379 · ggerganov/llama.cpp

The new llama2.c project provides means for training "baby" llama models stored in a custom binary format, with 15M and 44M models already available and more potentially coming out soon. ...

github.com

7

49

349

Georgi Gerganov

@ggerganov

1 year

Added a ggml example for using Cerebras-GPT I think the sampling needs some work because I can't get it to generate coherent stuff yet. Using quantized 6.7B model Here is the code and usage instructions if you want to play with it:

7

40

349

Georgi Gerganov

@ggerganov

1 year

So, someone just DM'd me on twitter a patch that improves the inference time by 10% on ARM NEON (i.e. Apple Silicon) Probably more people should go there and optimise this stuff

10% performance boost on ARM · ggerganov/llama.cpp@113a9e8

github.com

5

26

342

Georgi Gerganov

@ggerganov

11 months

600 posts per day actually sound great for me. Some days I do feel I waste too much time here

12

15

339

Georgi Gerganov

@ggerganov

5 months

Some performance stats for llama.cpp on A-series chips (iPhone / iPad) We are collecting benchmarks for 1B, 3B and 7B models at different quantization levels. Can be used as a reference for the expected LLM performance on these devices

Performance of llama.cpp on Apple Silicon A-series · ggerganov llama.cpp · Discussion #4508

Summary 🟥 - benchmark data missing 🟨 - benchmark data partial ✅ - benchmark data available PP means "prompt processing" (bs = 512), TG means "text-generation" (bs = 1) TinyLlama...

github.com

5

46

332

Georgi Gerganov

@ggerganov

9 months

Code Llama 34B using Q4_K_M quantization on a MacBook

Roman Janusz

@rjghik

9 months

@abacaj Quantized version running on MBP M2 Max (llama.cpp)

5

6

74

7

34

330

Georgi Gerganov

@ggerganov

1 year

Some good old airplane-mode programming No copilot, no voice control, no AR/VR, no AI augmentations, no cybernetic implants Just VIM and the sunrise

10

8

315

Georgi Gerganov

@ggerganov

1 year

Great write-up The CoreML branch speeds up just the Encoder. At the same time, the master branch already has additional ~2-3 factor of speed up in the Decoder thanks to recent work on llama.cpp When we merge these 2 together, the performance will be mind blowing

Ben Nortier

@bjnortier

1 year

Hello Transcribe 2.2 with CoreML is out, now 3x-7x faster 🚀🥳 Blog post: App Store: #OpenAI #AI #Whisper #CoreML

4

14

134

9

29

320

Georgi Gerganov

@ggerganov

6 months

What is the intuition of having the LLaMA layers to be the same size instead of, let's say increasing size: start with small hidden state and keep increasing as you go through the layers? At layer 1, we have just a single token with no context - 4096 hidden state seems too big

28

21

310

Georgi Gerganov

@ggerganov

1 year

And one more example using Mac OS "say" command

10

21

313

Georgi Gerganov

@ggerganov

1 year

Here are a few thoughts that I have about the recent on-device inference hype

Inference at the edge · ggerganov llama.cpp · Discussion #205

Inference at the edge Based on the positive responses to whisper.cpp, and more recently, llama.cpp, it looks like there is a strong and growing interest for doing efficient transformer model infere...

github.com

7

36

309

Georgi Gerganov

@ggerganov

11 months

First results on M2 Ultra using llama.cpp

7

14

304

Georgi Gerganov

@ggerganov

1 year

I'm incredibly grateful to @natfriedman and @danielgross for the support & funding and also for helping me get inspired even more in this project There is still a long way ahead with many ideas to try and cool things to do. Hope you will join and help us create something useful!

13

5

305

Georgi Gerganov

@ggerganov

9 months

Similar test with M2 Ultra using vanilla 70B LLaMA Prompt processing is currently unoptimized, but just the text generation yields ~13 tok/s. This is Q4_0 quantization (~39GB) using Metal

Teknium (e/λ)

@Teknium1

9 months

Llama-70B (StableBeluga2-70B) inference at home with exllama:

8

10

137

13

28

297

Georgi Gerganov

@ggerganov

11 months

The llama.cpp / ggml roadmap for June 2023

Roadmap June 2023 · ggerganov llama.cpp · Discussion #1729

New roadmap format as Github project: https://github.com/users/ggerganov/projects/7 Outdated below Previous: Roadmap May 2023 News The ggml project has been funded: Announcement: https://twitter.co...

github.com

7

35

294

Georgi Gerganov

@ggerganov

6 months

We've gathered some performance stats of llama.cpp on Apple Silicon. Can be useful to compare the speed across the M-series chips

9

24

293

Georgi Gerganov

@ggerganov

6 months

Don't stop there - add Whisper to talk to it

Robert Lukoszko — e/acc

@Karmedge

6 months

So I set up BakLLaVA-1 in the llama.cpp, and now it can provide real-time descriptions of the live feed from my camera check it out! open source? cc: @nisten @thursdai_pod @willdepue #llama

62

127

866

7

23

292

Georgi Gerganov

@ggerganov

2 months

Here is a thread with a (small) part of the exciting developments in llama.cpp over that past month

6

33

287

Georgi Gerganov

@ggerganov

1 year

I will soon be looking into hiring people to work full-time on ggml, and I'm also interested in hearing from companies that want to use ggml commercially and need technical support Contact jobs @ggml .ai and sales @ggml .ai

6

11

286

Georgi Gerganov

@ggerganov

5 months

Here is a quick demo with the 4-bit base model on M2 Ultra You can run this on your MacBooks - the quantized models are between 15GB ~ 50GB. The prompt processing speed is subpar atm, but we will improve this

15

26

279

Georgi Gerganov

@ggerganov

6 months

Love this demo! - Using the batched decoding API the continuations can be computed faster - Probably a longer context would be helpful. Could be computed in the background while typing

Dylan Freedman

@dylfreed

6 months

Prototyping a real-time AI writing tool to show how large language models are essentially probability engines. (thanks to @ggerganov 's llama.cpp for enabling this to run rapidly on an 8GB RAM MacBook Air)

16

39

399

2

34

277

Georgi Gerganov

@ggerganov

1 year

Basic MNIST inference example with ggml Thanks to @cromwellian for the contribution

5

29

271