Simo Ryu @cloneofsimo Twitter profile

Last Seen Profiles

@FranMooMoo

@pangea_app

@CoachBurke5

@wiiidi_35

@1cheesecakefunk

@AlwaysBcafc

@WaTEa123

@WuHanTopS

@jasonhart13

@WaTEa123

@stargateheaven

@TinaBausinger

@jaheimnelson

@jandakembangstw

@dang_van_long

@cole_marchetti

@f4iy4z

@mr_xtoon

@084armandinho

@Di4aasS

@WaTEa123

@brunotsalata

@TwistedMetal40

@Kachu0017

@Nobie38

@Melonn_wog1

@puntium

@Its_a_Nevin

@stw_pdg

@kwursus

@wanem1274141

@Muhammadyo42588

@roseorry

@jandakembangstw

@thomasalley4334

Simo Ryu

@cloneofsimo

3 months

Friendly reminder in case yall forgot: why is torch + cuda stack so incredibly popular and user friendly that it dominates all of ai market?

40

61

1K

Simo Ryu

@cloneofsimo

1 year

What we've been all waiting for months, finally here

15

148

1K

Simo Ryu

@cloneofsimo

1 month

Yes. Yes!!!! Everyone read this material three times!

GitHub - google-research/tuning_playbook: A playbook for systematically maximizing the performance...

A playbook for systematically maximizing the performance of deep learning models. - google-research/tuning_playbook

github.com

Hamel Husain

@HamelHusain

1 month

Has someone created materials around “fundamentals of ML for AI Engineers”, not focused on building models but things like evaluations, error analysis, etc Maybe something already exists? I don’t want to do it lol - looking for a resource I can share with people

36

31

368

7

93

646

Simo Ryu

@cloneofsimo

1 year

INSANE new model! This just wracked GLIGEN and sketch guided diffusion at the same time... 🤯

13

84

414

Simo Ryu

@cloneofsimo

8 months

If you want to train everything from scratch, 1. Train VAE 2. Train CLIP 3. Train LLM 4. Using 3, train captioner based on CLIP. 5. Finetune dense captioner 6. Relabel text image pair 7. Train unet based on 1,6 8. Train pixel decoder 9. Train LLM for upsample caption

11

67

411

Simo Ryu

@cloneofsimo

5 months

So year ago I introduced LoRA (which was at the time little known even to the LLM community, it was well before LLAMA / Peft) to image generation space. Little did I realize year later thousands of deepfake waifu LoRAs would be flooding on the web... 🫥

Emm

@emmanuel_2m

5 months

My model is now ready to make thousands of consistent generations... It's technically known as a LoRA (Low-Rank Adaptation), with SDXL as the base (foundation) model. From here, two options are possible: (i) Utilize your LoRA model independently, (ii) Or blend this LoRA with…

3

11

123

19

29

349

Simo Ryu

@cloneofsimo

1 year

This paper and their model is insane. Highly likely that these attention layers can be transferred to other fine-tuned models as well, which is truly groundbreaking feature for the SD community.

6

59

353

Simo Ryu

@cloneofsimo

10 months

Did you know SDXL can be implemented with 520 lines of code in single file? If you thought diffuser's unet code is now too big to understand in an hour, and wanted very limited but fully diffusers-compatible refactor of SDXL unet, this is for you

GitHub - cloneofsimo/minSDXL: Huggingface-compatible SDXL Unet implementation that is readily...

Huggingface-compatible SDXL Unet implementation that is readily hackable - cloneofsimo/minSDXL

github.com

9

53

300

Simo Ryu

@cloneofsimo

1 month

Personally, i feel very good today. Achievement Unlocked: successfully train very large diffusion model from scratch, entirely on my own codebase! (of course, not like SD3 papers codebase is out or anything..)

12

14

282

Simo Ryu

@cloneofsimo

2 months

YES!!!! TOOK 26 hours to make this happen: conditional D3PM implementation with pytorch. Let's accelerate discrete diffusion research!!! 👏I believe this is the only torch implementation of it out there! Less than 400 LOC! paper:

5

41

275

Simo Ryu

@cloneofsimo

11 months

Here is a cool little hack I found with AnimateDiff: instead of just sampling, by introducing variance-preserving self-correlation in time axis, you can achieve "lesser flickering motion". corr = [0.9, 0.7, 0.2, 0.0(Just sampling)].

7

39

265

Simo Ryu

@cloneofsimo

4 months

So you've had your fun with @karpathy 's mingpt. Now its time to scale : introducing min-max-gpt: really small codebase that scales with help of @MSFTDeepSpeed . No huggingface accelerate, transformer. Just deepspeed + torch: maximum hackability

7

35

249

Simo Ryu

@cloneofsimo

1 month

Again, the paper im advocating here is from openai, and is referenced all the time and frankly one of the paper all large scale practitioner should read. the math here isn't complicated and nothing here is either controversial nor task dependent.

12

20

224

Simo Ryu

@cloneofsimo

1 month

Wondered how SD3 was trained? Me too 😅, but I tried my best to replicate that today! Scalable transformer based rectified flow, following SD3's logit-normal sampler and llama-dit architecture. Enjoy!

9

41

223

Simo Ryu

@cloneofsimo

22 days

Hi, this is Lavenderflow-5.6B-v0.0 ✅MMDiT, muP, CFM, FSDP, recaped, 768x768, T5 ✅No strings attached, completely-open-every-step-of-the-way ✅Not SoTA😅(hey it was trained by one grad-student under total 3 weeks of development.) Severely undertrained!

13

32

220

Simo Ryu

@cloneofsimo

1 year

I've managed to fine-tune Kandinsky 2.1 model. I think I'm the first one to get it done (because there is no doc on the repo and model structure is rather strange, and really not trivial to fine-tune). Model itself is really good as the FID promised.

16

24

219

Simo Ryu

@cloneofsimo

29 days

5.6B param SD3 replication TODO: 1. find dudes with lot of compute: ✅ 2. check MMDiT scales upto 5B : ✅ 3. download, deduplicate 120M dataset : ✅ 4. preprocess VAE: ✅ 5. (Won't do aesth filtering!!) ✅ 6. recaption with BLIP-3 or sth + T5 emb ✅ 7. gpus go brr LETS GOOO⌛️⌛️

10

16

213

Simo Ryu

@cloneofsimo

1 year

At this point just so many SD related techs are getting pumped in its near impossible to catch up 🤣 either way, here goes another controlnet like model from tencet

3

31

201

Simo Ryu

@cloneofsimo

1 year

I've ported t2i-adapter to be compatible to diffusers library, go ahead and use them! Example with Anythingv3 model + LoRA + T2I Adapter. (all with diffusers!)

7

34

193

Simo Ryu

@cloneofsimo

30 days

Ok, my 5.4 B freaking-absolute-overkill ImageNet 1K rectified flow model is now finished. this is trained for 320K steps, with bs 128, meaning its SIGNIFICANTLY undertrained. However, it is lookin *very good* for its training budget. Also training was very stable, 0-loss-spike!

6

10

185

Simo Ryu

@cloneofsimo

2 months

Uhh excuse me wtf LLAMA3 ranking 1st????? in lmsys arena in English? Kudos to team @AIatMeta , based AF 👏👏 for open sourcing literal GPT-4 level model, (almost) no strings attached🥳

6

19

181

Simo Ryu

@cloneofsimo

1 year

Finally, on-par quality with Dreambooth, updated + optimized PTI CLI, SVD distillation CLI, flexible dataset and CLIP metrics utilities, wandb logging, v0.1.0 is finally out!

6

28

179

Simo Ryu

@cloneofsimo

1 year

Cannot emphasize this enough, but you only have to train LoRA once and you can apply them anywhere. Below case is with , which is pretty awesome model. Configs from

5

13

169

Simo Ryu

@cloneofsimo

23 days

Normal people's hobby : listening to music, sports, video games... Me : speedrun pretraining 5B T2I DiT from scratch under 3 weeks RELEASING SOON!!!!! (btw this is pretrained ver, gotta train on hi-res)

8

15

171

Simo Ryu

@cloneofsimo

2 months

Did you know Imagenet fits on your apple watch's RAM? introducing imagenet.int8: 5GB, Cropped, VAEed, quantized version of imagenet, 26x compression in total, preprocessed in StreamingDataset format. Enjoy.

13

25

162

Simo Ryu

@cloneofsimo

1 month

In a equal compute budget, using larger batch almost always implies worse performance. Rationale for using larger batch-size should always be for sake of faster convergence in equal *time*, not better performance in equal compute budget

8

14

159

Simo Ryu

@cloneofsimo

1 month

SD3 replication TODO: 1. find dudes with lot of compute: ✅ 2. check MMDiT scales upto 5B : ✅ 3. download, deduplicate 120M dataset : ✅ 4. preprocess VAE: ✅ 5. aesthetic filter with HPSv2 6. recaption with BLIP-3 or sth + T5 emb 7. gpus go brr -> fail multiple times

8

9

156

Simo Ryu

@cloneofsimo

3 months

But to be honest, there's been tons of low-rank, quantized gradient-approximation for efficient allgathers that the paper didn't mention for some reason. Like, not citing PowerSGD?? this? …Like man totally not cool 🙄 fig from psgd

AK

@_akhaliq

3 months

GaLore Memory-Efficient LLM Training by Gradient Low-Rank Projection Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank

17

168

870

3

25

150

Simo Ryu

@cloneofsimo

1 year

Its really cool high quality compressed model, especially thinking that they achieved this with single a100 machine! +1 for awesome demo page as well.

Compressed Stable Diffusion - a Hugging Face Space by nota-ai

huggingface.co

1

24

132

Simo Ryu

@cloneofsimo

1 year

I managed to get it work! 2 Step, No progressive distillation as promised, reasonable quality for dumb unet structure and 10 min of training. I think this is the only implementation out there (given its like 4 days old) Not bad!

6

16

125

Simo Ryu

@cloneofsimo

28 days

moon is high, model is 44k steps in, I stopped the run to check on everything and use multi-nodes, didn't expect *anything* at all.... However, safe to say, i've trained my FIRST ever 5.6B Text2Image MMDiT from scratch!!!

14

2

123

Simo Ryu

@cloneofsimo

10 months

My friend : "Stable diffusion's Unet is confusing" Me :

4

15

121

Simo Ryu

@cloneofsimo

11 months

Fully fine-tuning SDXL on OW Kiriko images. This took about 10 min. Can you believe this is fine-tuned Base model? BASE???? @StabilityAI is simply incredible.

9

14

118

Simo Ryu

@cloneofsimo

1 year

Did we just get another open sourced text2img model?

GitHub - thu-ml/unidiffuser: Code and models for the paper "One Transformer Fits All Distributions...

Code and models for the paper "One Transformer Fits All Distributions in Multi-Modal Diffusion" - thu-ml/unidiffuser

github.com

0

18

118

Simo Ryu

@cloneofsimo

28 days

"bUt iT woN't Be aS goOd wiTH yoUR teeNy coMpUte" nah i dont care im not raising cash bro, gaining this experience of handling 100M-scale dataset, pretraining billion-scale vision model from scratch, post-hoc analysis... *all as a hobby in my free time*, is what matters 😎

11

2

113

Simo Ryu

@cloneofsimo

1 year

Cool work, have a look! Interesting to see they tie the "probability" of discrete representation to, well, the probability of the dataset : Variational Inference itself.

1

23

109

Simo Ryu

@cloneofsimo

1 year

So this might be the current best usable form of encoder based inversion for SD 2.X models, Really good in terms of fidelity, but NC license is bit sad.

7

23

109

Simo Ryu

@cloneofsimo

1 month

Larger model being more sample efficient is arguably single most important rationale behind large-scale training. LLAMA3 made us forget that.

Ethan

@Ethan_smith_20

1 month

wtf

12

4

77

3

13

109

Simo Ryu

@cloneofsimo

2 months

Well, well, ain't this exciting.

Aran Komatsuzaki

@arankomatsuzaki

2 months

Google presents Mixture-of-Depths Dynamically allocating compute in transformer-based language models Same performance w/ a fraction of the FLOPs per forward pass

6

92

618

1

9

101

Simo Ryu

@cloneofsimo

1 month

Math is,,, incredible. I just fixed the learning rate faithful to muP suggested, now gradient norm is much more stable, my depression is cured, eyesight have improved, posture is better, and cured cancer.

8

1

98

Simo Ryu

@cloneofsimo

1 year

This is the "real" stable diffusion moment for LLMs. Goodbye llama.

Databricks Mosaic Research

@DbrxMosaicAI

1 year

📢 Introducing MPT: a new family of open-source commercially usable LLMs from @MosaicML . Trained on 1T tokens of text+code, MPT models match and - in many ways - surpass LLaMa-7B. This release includes 4 models: MPT-Base, Instruct, Chat, & StoryWriter (🧵)

22

216

1K

2

4

95

Simo Ryu

@cloneofsimo

1 year

Unlike Controlnet, T2i-adapter is lightweight, generalizable out-of-the-box, and is vey fast. It also does generate additional feature per-timestep. However, it seems to be less strict than Controlnet, thus one might prefer controlnet for truly fine-grained control.

6

11

96

Simo Ryu

@cloneofsimo

4 months

How did I not know this before? download model from hf to local visible directory via pip install hf_transfer export HF_HUB_ENABLE_HF_TRANSFER=True huggingface-cli download TheBloke/Yi-34B-Chat-AWQ --local-dir ./yiawq NO JOKE 100x speedup

3

6

97

Simo Ryu

@cloneofsimo

2 months

If you binarize the MNIST, and run d3pm, it is literally discrete diffusion on QR-code space lol🤪

1

94

Simo Ryu

@cloneofsimo

1 month

First looks on training 0.9B IN1k model, 67k steps in, im already getting pretty decent quality images!! minRF is damn scalable with help of @MSFTDeepSpeed ! 👉 [ rectified flow, muP, SDXL vae, MMDiT, cfg = 7.0!]

4

5

95

Simo Ryu

@cloneofsimo

3 months

Bro casually contributes to all of AI industry...

0

87

Simo Ryu

@cloneofsimo

1 year

Huh, so it looks like triton's Flash Attention is significantly faster than torch's integrated SDPA flashattention (which is much faster than naive attention). This was done on 3070 Ti GPU

4

16

81

Simo Ryu

@cloneofsimo

1 year

Cool paper from google! Exciting idea to use multiple latent per cross attention. There might be a room for correlated optimization, where some tokens being injected share multiple common embeddings.i.e., inject another common token t_s during optimizatio

1

21

84

Simo Ryu

@cloneofsimo

10 months

Text2characters... Absolute madmans doing insane works... looks like they will be releasing code as well

3

16

84

Simo Ryu

@cloneofsimo

5 months

Recently, Karras demonstrated post-hoc ema method, where he was able to "simulate" arbitrary ema decaying factor after the training by saving two copies of ema and clever math. I took a deep breath to understand it, and wrote a tutorial + working example!

1

12

80

Simo Ryu

@cloneofsimo

1 month

Now that my 5.4B model is stably training (pun intended), next goal is to deduplicate wds + filter + recaption. I've done my job on deduplication multiple times before, but here is my best attempt yet, fully following SD3's approach with SSCD emb Enjoy!

8

10

81

Simo Ryu

@cloneofsimo

1 year

Since the authors didn't upload the code, here is my attempt at Prompt+! (below results is from my impl). Also further tested out the "correlated extended embedding" idea, which seems to be working (rather it is better or not is unclear)

4

11

79

Simo Ryu

@cloneofsimo

1 year

Lucky enough to collaborate with @huggingface 's diffusers team (more like watching them implement🤣 I wrote no code) and... huge updates! Now LoRA is officially integrated with diffusers! There are major difference from my implementations, very simple to use!

Sayak Paul

@RisingSayak

1 year

Fine-tune Stable Diffusion in T4/V100 on a custom image-caption pairs' dataset 🧨 🔥 => memory efficiency This is enabled by LoRA. With LoRA, the fine-tuned checkpoints are just **3 MBs** in size 🤯 => portability Know about it👇

2

43

284

3

15

79

Simo Ryu

@cloneofsimo

5 months

Even with 16 samples, FINE TUNING PERFORMS SIGNIFICANTLY BETTER THAN ICL!!! Everyone fine-tune your weights not discrete prompts!😋

3

8

75

Simo Ryu

@cloneofsimo

29 days

Btw, this was done on int8 quantized dataset i shared couple weeks ago, which is 26x smaller than the original dataset!!! Imo clever dataset quantization has a lot to offer.

GitHub - cloneofsimo/imagenet.int8

Contribute to cloneofsimo/imagenet.int8 development by creating an account on GitHub.

github.com

Simo Ryu

@cloneofsimo

30 days

Ok, my 5.4 B freaking-absolute-overkill ImageNet 1K rectified flow model is now finished. this is trained for 320K steps, with bs 128, meaning its SIGNIFICANTLY undertrained. However, it is lookin *very good* for its training budget. Also training was very stable, 0-loss-spike!

6

10

185

5

2

77

Simo Ryu

@cloneofsimo

1 month

Watch my compute optimal 5.4B rectified model go. I don't have to say this again, but... muP just makes everything easier.

4

3

75

Simo Ryu

@cloneofsimo

14 days

"Oh the bitter lesson? Yeah I love the bitter lesson!" -- gpu rich

4

76

Simo Ryu

@cloneofsimo

1 year

New trick that works insanely well! How would one mitigate spurious correlation that occurs during fine-tuning? Identify the dataset on the region of interest! [1/n]

4

10

72

Simo Ryu

@cloneofsimo

10 months

I wouldn't have come up with using lora for dreambooth if I had beefy A100 gpu to play around 😂Now even the "GPU-rich" uses lora to fine-tune diffusion model.

hardmaru

@hardmaru

10 months

I prefer to operate in “GPU-Poor” mode. I don’t agree with the take from the semianalysis piece. Creative breakthroughs often occur under constraints—new systems, models, and methods that can better take advantage of even larger-scale compute

72

135

1K

1

9

72

Simo Ryu

@cloneofsimo

1 year

Got my hands on it. Super easy to use, and some findings : 1. Works with Textual inversion, custom models, and LoRA. Incredible flexibility 2. Prompting + Guidance has non-negligible effect here. 3. Sub-second upscaling. Almost free lunch.

1

10

71

Simo Ryu

@cloneofsimo

1 year

Sometimes the very existence of HF team is bit... unreal. Like imagine if we *didn't* have huggingface.

Sayak Paul

@RisingSayak

1 year

🧨 diffusers 0.17.0 is out and comes with new pipelines, improved LoRA support, `torch.compile()` speedups, and more ⏰ 🪄 UniDiffuser 🦄 DiffEdit ⚡️ IF DreamBooth 💡 Support for A1111 LoRA and more ... Release notes 📝 1/🧶

6

60

303

2

3

68

Simo Ryu

@cloneofsimo

1 year

I officially graduated btw

13

0

69

Simo Ryu

@cloneofsimo

16 days

Ok great day for progressive training today: One for diffusion : train the core t2i component efficiently, freeze it and train first / last layers later on One for LLM: block expansion for 50% speedup. Great stuff!!

2

8

69

Simo Ryu

@cloneofsimo

10 months

Had such fun time putting this on @replicatehq via Cog with @allnoteson , @daannelson , @anotherjesse ! Fine tuning supports for all Dreambooth, Textual inversion, and LoRA. CLIPSeg masking, BLIP-captioning, SwinIR upscaling preprocessing! + entire thing open sourced.

Replicate

@replicate

10 months

SDXL is the best open-source image model ever created. Now you can fine-tune it with your own images on Replicate.

5

21

134

8

6

68

Simo Ryu

@cloneofsimo

1 year

I'm want access to SDXL so badly... and while we don't yet have access to @StabilityAI latest model, SDXL, we *do* have access to newest VAE. .

stabilityai/sdxl-vae · Hugging Face

huggingface.co

5

7

64

Simo Ryu

@cloneofsimo

4 months

I think this is first ever open-reproduced results of > 1B scale of muP of @TheGregYang and @edwardjhu . following muP formula, you get to sweep on the 100M scale models and transfer successfully on 4B model (This sweep took 3 days on 8xA100 GPUs lol)

4

12

65

Simo Ryu

@cloneofsimo

8 months

I feel like 2024 will be wild with consistency models.

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

8 months

On arXiv now, I told y'all it was from @DrYangSong ! (and from @prafdhar too)

1

26

199

5

2

67

Simo Ryu

@cloneofsimo

1 year

Just released version of 0.0.7! Thanks to all the contributors, now you can use different optimizers for embedding and LoRAs, benefit from textual inversion directly, inspect LoRAs, better module finder, fine-tune MLPs, use safetensors, and trainer CLIs!

3

8

64

Simo Ryu

@cloneofsimo

1 year

This project (code not released yet) is awesome, but what in the world is "regularized DDIM inversion"? Is it literally imposing prior of the latent with scheduled normal distribution and Bayesian updating during inversion accordingly?

4

9

64

Simo Ryu

@cloneofsimo

1 month

🤔interesting, I will give 1000$ to anyone that finds task where a larger batch size leads to more compute-efficient optimization. i.e., where figure like following is *not* monotonically decreasing.

Aakash Kumar Nain

@A_K_Nain

1 month

True and not true. How? It depends on the task. For example, in case of supervised learning it is true for many cases (still not all), and for contrastive learning a bigger batch is always preferable. Want more nuanced take? "Bigger batch size" is very relative, and very much…

3

1

27

10

7

64

Simo Ryu

@cloneofsimo

1 month

Waking up to see this.... feels good man

4

1

60

Simo Ryu

@cloneofsimo

1 year

Ever dreamed of mixing models "during sampling"? With LoRA, now you can! If you fine-tune your model too much, it loses information on other stuff, so it loses general composability. Now, you might've wished for applying model A on first 25 steps, and model B on later 25. [1/n]

5

3

61

Simo Ryu

@cloneofsimo

26 days

We need a better dataloader for pytorch, that is in a sense mix of MDS of @DbrxMosaicAI Webdataset, and sql We should be able to join data columns. We should be able to filter, (some sortof query language on the fly), in a efficient distributed manner...

7

4

61

Simo Ryu

@cloneofsimo

1 month

There is just something really cool about deep learning.

4

2

59

Simo Ryu

@cloneofsimo

2 months

Effective free lunch I made today! Karras EMAing once in every K steps and adjusting beta respectively, is free lunch. (+ when you do on cpu-offloaded fashion, this is effectively zero-cost EMA!). Code ->

2

9

59

Simo Ryu

@cloneofsimo

3 months

There is no such thing as silver bullet, and it all depends on the downstream domain. However, in many cases "nicely done" fine-tuning ALWAYS performs better than In-context learning. It's actually typically the case that LMs can be fine-tuned to perform better at ICL (metaicl)

Max Woolf

@minimaxir

3 months

Extremely hot LLM take: you will often get better results with few-shot prompting (with good examples) on a modern LLM than with a finetuned LLM. Finetuning was the best option for weaker LLMs with lower context windows: both problems have been solved nowadays.

37

36

384

4

10

58

Simo Ryu

@cloneofsimo

2 months

This is the only ever legitimate use case of abstract algebra within deep learning research god damn its so cool.... (you know, not being one of those papers that use high level math just for the sake of it 🙄)

5

3

55

Simo Ryu

@cloneofsimo

1 year

Ok I cannot believe this but this actually worked, given same number of steps, skipping 50% of the initial inversion steps (so that 0.5 < T < 1.0 steps are finer) helps inversion significantly... Check the code out if interested

Simo Ryu

@cloneofsimo

1 year

If the unalignment between x_t and x_t+1 is large on beginning (x_T), why don't we use smaller DDIM steps at the later stage of DDIM inversion? i.e., reparametrize scheduler to have finer close to t ~ 1?

2

1

19

3

5

56

Simo Ryu

@cloneofsimo

10 months

DDIM inversion pipeline: green horse -> fantasy black horse, masterpiece, 4 K

0

3

57

Simo Ryu

@cloneofsimo

1 year

Importance of visual-language model, especially CLIP 📎 is ever-growing. I know that a lot of my followers are hardcore ML engineers/researchers, interested in multimodal training so here are set of very recent literatures in faster, better performing training of CLIPs. 🧵 [1/n]

4

6

56

Simo Ryu

@cloneofsimo

14 days

Bounded gaps between prime, by Yitang Zhang is arguably the most important paper towards solving the twin prime conjecture... has about 700 citations. Meanwhile, random LMM paper that gives you one-liner prompt engineering pro tips boosts +3 point on MMLU has 1k+ citations.

4

7

59

Simo Ryu

@cloneofsimo

1 year

Oh here we go again

stabilityai/stable-diffusion-xl-base-0.9 · Hugging Face

huggingface.co

4

11

52

Simo Ryu

@cloneofsimo

1 month

Just got the results!!! MMDiT 🤝muP. infinite width never disappoints 🫡 @TheGregYang Gradient norm: never blows up, Loss : never spikes, any scale! Feature updates: Maximal🌊🌊 The code to reproduce this ->

2

4

53

Simo Ryu

@cloneofsimo

22 days

@Birchlabs @StefanABaumann @SeunghyunSEO7 @imbue_ai Ok, this is *ONLY* the beginning. While I was broadcasting these progress on twitter @FAL guys reached me out to plan on making this more powerful, and go on and build > 8B models from scratch, using better methods, better captioned datasets, everything! All open-sourced!

2

3

52

Simo Ryu

@cloneofsimo

1 month

Every modern large scale ML practitioner should read the following three papers imo 1. Scaling 2. Scaling batch size 3. Scaling in transfer Oh! they just happened to be all from open ai 🤔 no wonder 🤷‍♂️

Scaling Laws for Transfer

We study empirical scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. When we train increasingly large neural networks from-scratch on a fixed-size...

arxiv.org

1

8

52

Simo Ryu

@cloneofsimo

1 month

I can't be the only one to have missed this, but let me speculate GPT3.5 / GPT4 was trained on PowerSGD. Why? Because they turned to PowerSGD on Dalle-1 paper, and unless that really turned out to be good in that scale without much compromises, they simply wouldn't have done it

1

7

50

Simo Ryu

@cloneofsimo

10 months

Personal update : I am delaying my masters in robotics and will be joining Naver (its the largest tech company in Korea) for next three month to do research on rlhf and build clova-x (Korean LLM). I will continue to build open source stuff on t2i and do side projects!

3

0

49

Simo Ryu

@cloneofsimo

10 months

Let's profile SDXL : 1. Quite something to see that Transformer module is still largely the bottleneck

3

0

49

Simo Ryu

@cloneofsimo

5 months

Here is another project I worked on for past 3 weeks : Language reroll + FIM-LLAMA. llama with fill-in-the-middle capability + sleek interface you can use vLLM for context-aware document inpainting.

3

4

49

Simo Ryu

@cloneofsimo

11 months

So SDXL works great with prompt weighting! You have different text encoders, but have them prompt -weight separately, concatenate them, and sample. (val ranging from 0.8 to -0.8)

2

49

Simo Ryu

@cloneofsimo

13 days

bros... just do turn on `split_by_worker` and ShardList to max out IO. It will not get faster otherwise, not prefetch_factor, no num_workers, non. idk why this worked, YOU WILL THANK ME LATER. I wasted 3 hours so just sharing. Tell me more if you know y

2

0

51

Simo Ryu

@cloneofsimo

5 months

Here is a small project I bashed on this weekend : "ezmup" muP is effective weight init scheme everyone should use. With ezmup 3 LOC is all you need+ it's model agnostic!🔨 mup = Ezmup(width, model) mup.change_width_as(64) But... What *is* muP ? [1/n]

2

8

48

Simo Ryu

@cloneofsimo

1 year

Ok, so I've done very sloppy PoC of Consistency Models

GitHub - cloneofsimo/consistency_models: Unofficial Implementation of Consistency Models in pytorch

Unofficial Implementation of Consistency Models in pytorch - cloneofsimo/consistency_models

github.com

1

8

47

Simo Ryu

@cloneofsimo

2 months

YES. DO. USE. MUP. At. ALL. COSTS. REGARDLESS. OF. YOUR. TASK. MODEL. ETC. -> use it right now see it transfer optimal lr right now,

GitHub - cloneofsimo/min-max-gpt: Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT...

Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training - cloneofsimo/min-max-gpt

github.com

Aran Komatsuzaki

@arankomatsuzaki

2 months

A Large-Scale Exploration of μ-Transfer Investigates µP empirically, which works as intended for the majority of important cases, from 2M to 10B parameters, with some outliers

2

25

145

1

8

47

Simo Ryu

@cloneofsimo

24 days

3 days in, I see gradient norm sloooowly increasing. I found this to be the case with OLMO's training as well. Seriously, whats the good framework to explain this? Is this edge-of-stability happening IRL? 🤔🤔

7

2

48

Simo Ryu

@cloneofsimo

11 months

SDXL 1.0 works very well with full fine tuning + textual inversion!

anotherjesse

@anotherjesse

11 months

We are already having lots of fun exploring fine-tuning of #sdxl on @replicatehq This WIP by @cloneofsimo really captures the essence of @zeke at work Looking forward to both enabling both fine-tuning and LoRA

1

5

40

3

4

46

Simo Ryu

@cloneofsimo

1 year

That is some gigachad move right there, scaling GANs right when everyone is turning their heads to diffusions

AK

@_akhaliq

1 year

Scaling up GANs for Text-to-Image Synthesis present our 1B-parameter GigaGAN, achieving lower FID than Stable Diffusion v1.5, DALL·E 2, and Parti-750M. It generates 512px outputs at 0.13s, orders of magnitude faster than diffusion and autoregressive …

40

293

1K

2

5

46

Simo Ryu

@cloneofsimo

1 month

Recent lists on scaling, muP for someone like me dumping with love

A Large-Scale Exploration of $μ$-Transfer

Large neural network models have become a mainstay of natural language processing and computer vision, yet their initialization and learning rates are set in a largely heuristic fashion,...

arxiv.org

0

5

46

Simo Ryu

@cloneofsimo

1 year

Another example : Mo-di model from @Nitrosocke distilled. prompt : "modern disney style, cute baby lion" Updated distillation will be available at v0.1.2!

1

4

46

Simo Ryu

@cloneofsimo

1 year

True beauty of LoRA is that you can train one with v1.5., I can apply them to any SD model (of course, given they are similar enough). I applied Wednesday LoRA to @Nitrosocke 's redshift diffusion to get the following: So... Wednesday smiles?

3

4

43