Simo Ryu Profile Banner
Simo Ryu Profile
Simo Ryu

@cloneofsimo

3,776
Followers
431
Following
330
Media
1,231
Statuses

I like cats, math and codes cloneofsimo @gmail .com

Seoul, Korea
Joined May 2022
Don't wanna be here? Send us removal request.
@cloneofsimo
Simo Ryu
3 months
Friendly reminder in case yall forgot: why is torch + cuda stack so incredibly popular and user friendly that it dominates all of ai market?
Tweet media one
40
61
1K
@cloneofsimo
Simo Ryu
1 year
What we've been all waiting for months, finally here
Tweet media one
15
148
1K
@cloneofsimo
Simo Ryu
1 month
Yes. Yes!!!! Everyone read this material three times!
@HamelHusain
Hamel Husain
1 month
Has someone created materials around “fundamentals of ML for AI Engineers”, not focused on building models but things like evaluations, error analysis, etc Maybe something already exists? I don’t want to do it lol - looking for a resource I can share with people
36
31
368
7
93
646
@cloneofsimo
Simo Ryu
1 year
INSANE new model! This just wracked GLIGEN and sketch guided diffusion at the same time... 🤯
Tweet media one
13
84
414
@cloneofsimo
Simo Ryu
8 months
If you want to train everything from scratch, 1. Train VAE 2. Train CLIP 3. Train LLM 4. Using 3, train captioner based on CLIP. 5. Finetune dense captioner 6. Relabel text image pair 7. Train unet based on 1,6 8. Train pixel decoder 9. Train LLM for upsample caption
Tweet media one
11
67
411
@cloneofsimo
Simo Ryu
5 months
So year ago I introduced LoRA (which was at the time little known even to the LLM community, it was well before LLAMA / Peft) to image generation space. Little did I realize year later thousands of deepfake waifu LoRAs would be flooding on the web... 🫥
@emmanuel_2m
Emm
5 months
My model is now ready to make thousands of consistent generations... It's technically known as a LoRA (Low-Rank Adaptation), with SDXL as the base (foundation) model. From here, two options are possible: (i) Utilize your LoRA model independently, (ii) Or blend this LoRA with…
Tweet media one
3
11
123
19
29
349
@cloneofsimo
Simo Ryu
1 year
This paper and their model is insane. Highly likely that these attention layers can be transferred to other fine-tuned models as well, which is truly groundbreaking feature for the SD community.
Tweet media one
6
59
353
@cloneofsimo
Simo Ryu
10 months
Did you know SDXL can be implemented with 520 lines of code in single file? If you thought diffuser's unet code is now too big to understand in an hour, and wanted very limited but fully diffusers-compatible refactor of SDXL unet, this is for you
9
53
300
@cloneofsimo
Simo Ryu
1 month
Personally, i feel very good today. Achievement Unlocked: successfully train very large diffusion model from scratch, entirely on my own codebase! (of course, not like SD3 papers codebase is out or anything..)
12
14
282
@cloneofsimo
Simo Ryu
2 months
YES!!!! TOOK 26 hours to make this happen: conditional D3PM implementation with pytorch. Let's accelerate discrete diffusion research!!! 👏I believe this is the only torch implementation of it out there! Less than 400 LOC! paper:
Tweet media one
Tweet media two
5
41
275
@cloneofsimo
Simo Ryu
11 months
Here is a cool little hack I found with AnimateDiff: instead of just sampling, by introducing variance-preserving self-correlation in time axis, you can achieve "lesser flickering motion". corr = [0.9, 0.7, 0.2, 0.0(Just sampling)].
7
39
265
@cloneofsimo
Simo Ryu
4 months
So you've had your fun with @karpathy 's mingpt. Now its time to scale : introducing min-max-gpt: really small codebase that scales with help of @MSFTDeepSpeed . No huggingface accelerate, transformer. Just deepspeed + torch: maximum hackability
Tweet media one
7
35
249
@cloneofsimo
Simo Ryu
1 month
Again, the paper im advocating here is from openai, and is referenced all the time and frankly one of the paper all large scale practitioner should read. the math here isn't complicated and nothing here is either controversial nor task dependent.
Tweet media one
12
20
224
@cloneofsimo
Simo Ryu
1 month
Wondered how SD3 was trained? Me too 😅, but I tried my best to replicate that today! Scalable transformer based rectified flow, following SD3's logit-normal sampler and llama-dit architecture. Enjoy!
9
41
223
@cloneofsimo
Simo Ryu
22 days
Hi, this is Lavenderflow-5.6B-v0.0 ✅MMDiT, muP, CFM, FSDP, recaped, 768x768, T5 ✅No strings attached, completely-open-every-step-of-the-way ✅Not SoTA😅(hey it was trained by one grad-student under total 3 weeks of development.) Severely undertrained!
Tweet media one
13
32
220
@cloneofsimo
Simo Ryu
1 year
I've managed to fine-tune Kandinsky 2.1 model. I think I'm the first one to get it done (because there is no doc on the repo and model structure is rather strange, and really not trivial to fine-tune). Model itself is really good as the FID promised.
Tweet media one
Tweet media two
16
24
219
@cloneofsimo
Simo Ryu
29 days
5.6B param SD3 replication TODO: 1. find dudes with lot of compute: ✅ 2. check MMDiT scales upto 5B : ✅ 3. download, deduplicate 120M dataset : ✅ 4. preprocess VAE: ✅ 5. (Won't do aesth filtering!!) ✅ 6. recaption with BLIP-3 or sth + T5 emb ✅ 7. gpus go brr LETS GOOO⌛️⌛️
Tweet media one
10
16
213
@cloneofsimo
Simo Ryu
1 year
At this point just so many SD related techs are getting pumped in its near impossible to catch up 🤣 either way, here goes another controlnet like model from tencet
Tweet media one
3
31
201
@cloneofsimo
Simo Ryu
1 year
I've ported t2i-adapter to be compatible to diffusers library, go ahead and use them! Example with Anythingv3 model + LoRA + T2I Adapter. (all with diffusers!)
Tweet media one
7
34
193
@cloneofsimo
Simo Ryu
30 days
Ok, my 5.4 B freaking-absolute-overkill ImageNet 1K rectified flow model is now finished. this is trained for 320K steps, with bs 128, meaning its SIGNIFICANTLY undertrained. However, it is lookin *very good* for its training budget. Also training was very stable, 0-loss-spike!
6
10
185
@cloneofsimo
Simo Ryu
2 months
Uhh excuse me wtf LLAMA3 ranking 1st????? in lmsys arena in English? Kudos to team @AIatMeta , based AF 👏👏 for open sourcing literal GPT-4 level model, (almost) no strings attached🥳
Tweet media one
6
19
181
@cloneofsimo
Simo Ryu
1 year
Finally, on-par quality with Dreambooth, updated + optimized PTI CLI, SVD distillation CLI, flexible dataset and CLIP metrics utilities, wandb logging, v0.1.0 is finally out!
Tweet media one
6
28
179
@cloneofsimo
Simo Ryu
1 year
Cannot emphasize this enough, but you only have to train LoRA once and you can apply them anywhere. Below case is with , which is pretty awesome model. Configs from
Tweet media one
5
13
169
@cloneofsimo
Simo Ryu
23 days
Normal people's hobby : listening to music, sports, video games... Me : speedrun pretraining 5B T2I DiT from scratch under 3 weeks RELEASING SOON!!!!! (btw this is pretrained ver, gotta train on hi-res)
8
15
171
@cloneofsimo
Simo Ryu
2 months
Did you know Imagenet fits on your apple watch's RAM? introducing imagenet.int8: 5GB, Cropped, VAEed, quantized version of imagenet, 26x compression in total, preprocessed in StreamingDataset format. Enjoy.
Tweet media one
13
25
162
@cloneofsimo
Simo Ryu
1 month
In a equal compute budget, using larger batch almost always implies worse performance. Rationale for using larger batch-size should always be for sake of faster convergence in equal *time*, not better performance in equal compute budget
Tweet media one
8
14
159
@cloneofsimo
Simo Ryu
1 month
SD3 replication TODO: 1. find dudes with lot of compute: ✅ 2. check MMDiT scales upto 5B : ✅ 3. download, deduplicate 120M dataset : ✅ 4. preprocess VAE: ✅ 5. aesthetic filter with HPSv2 6. recaption with BLIP-3 or sth + T5 emb 7. gpus go brr -> fail multiple times
8
9
156
@cloneofsimo
Simo Ryu
3 months
But to be honest, there's been tons of low-rank, quantized gradient-approximation for efficient allgathers that the paper didn't mention for some reason. Like, not citing PowerSGD?? this? …Like man totally not cool 🙄 fig from psgd
Tweet media one
@_akhaliq
AK
3 months
GaLore Memory-Efficient LLM Training by Gradient Low-Rank Projection Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank
Tweet media one
17
168
870
3
25
150
@cloneofsimo
Simo Ryu
1 year
Its really cool high quality compressed model, especially thinking that they achieved this with single a100 machine! +1 for awesome demo page as well.
1
24
132
@cloneofsimo
Simo Ryu
1 year
I managed to get it work! 2 Step, No progressive distillation as promised, reasonable quality for dumb unet structure and 10 min of training. I think this is the only implementation out there (given its like 4 days old) Not bad!
Tweet media one
6
16
125
@cloneofsimo
Simo Ryu
28 days
moon is high, model is 44k steps in, I stopped the run to check on everything and use multi-nodes, didn't expect *anything* at all.... However, safe to say, i've trained my FIRST ever 5.6B Text2Image MMDiT from scratch!!!
14
2
123
@cloneofsimo
Simo Ryu
10 months
My friend : "Stable diffusion's Unet is confusing" Me :
Tweet media one
4
15
121
@cloneofsimo
Simo Ryu
11 months
Fully fine-tuning SDXL on OW Kiriko images. This took about 10 min. Can you believe this is fine-tuned Base model? BASE???? @StabilityAI is simply incredible.
Tweet media one
Tweet media two
9
14
118
@cloneofsimo
Simo Ryu
28 days
"bUt iT woN't Be aS goOd wiTH yoUR teeNy coMpUte" nah i dont care im not raising cash bro, gaining this experience of handling 100M-scale dataset, pretraining billion-scale vision model from scratch, post-hoc analysis... *all as a hobby in my free time*, is what matters 😎
11
2
113
@cloneofsimo
Simo Ryu
1 year
Cool work, have a look! Interesting to see they tie the "probability" of discrete representation to, well, the probability of the dataset : Variational Inference itself.
Tweet media one
1
23
109
@cloneofsimo
Simo Ryu
1 year
So this might be the current best usable form of encoder based inversion for SD 2.X models, Really good in terms of fidelity, but NC license is bit sad.
Tweet media one
7
23
109
@cloneofsimo
Simo Ryu
1 month
Larger model being more sample efficient is arguably single most important rationale behind large-scale training. LLAMA3 made us forget that.
Tweet media one
@Ethan_smith_20
Ethan
1 month
wtf
Tweet media one
12
4
77
3
13
109
@cloneofsimo
Simo Ryu
2 months
Well, well, ain't this exciting.
Tweet media one
@arankomatsuzaki
Aran Komatsuzaki
2 months
Google presents Mixture-of-Depths Dynamically allocating compute in transformer-based language models Same performance w/ a fraction of the FLOPs per forward pass
Tweet media one
6
92
618
1
9
101
@cloneofsimo
Simo Ryu
1 month
Math is,,, incredible. I just fixed the learning rate faithful to muP suggested, now gradient norm is much more stable, my depression is cured, eyesight have improved, posture is better, and cured cancer.
Tweet media one
Tweet media two
8
1
98
@cloneofsimo
Simo Ryu
1 year
This is the "real" stable diffusion moment for LLMs. Goodbye llama.
@DbrxMosaicAI
Databricks Mosaic Research
1 year
📢 Introducing MPT: a new family of open-source commercially usable LLMs from @MosaicML . Trained on 1T tokens of text+code, MPT models match and - in many ways - surpass LLaMa-7B. This release includes 4 models: MPT-Base, Instruct, Chat, & StoryWriter (🧵)
Tweet media one
22
216
1K
2
4
95
@cloneofsimo
Simo Ryu
1 year
Unlike Controlnet, T2i-adapter is lightweight, generalizable out-of-the-box, and is vey fast. It also does generate additional feature per-timestep. However, it seems to be less strict than Controlnet, thus one might prefer controlnet for truly fine-grained control.
Tweet media one
Tweet media two
6
11
96
@cloneofsimo
Simo Ryu
4 months
How did I not know this before? download model from hf to local visible directory via pip install hf_transfer export HF_HUB_ENABLE_HF_TRANSFER=True huggingface-cli download TheBloke/Yi-34B-Chat-AWQ --local-dir ./yiawq NO JOKE 100x speedup
3
6
97
@cloneofsimo
Simo Ryu
2 months
If you binarize the MNIST, and run d3pm, it is literally discrete diffusion on QR-code space lol🤪
1
1
94
@cloneofsimo
Simo Ryu
1 month
First looks on training 0.9B IN1k model, 67k steps in, im already getting pretty decent quality images!! minRF is damn scalable with help of @MSFTDeepSpeed ! 👉 [ rectified flow, muP, SDXL vae, MMDiT, cfg = 7.0!]
Tweet media one
4
5
95
@cloneofsimo
Simo Ryu
3 months
Bro casually contributes to all of AI industry...
0
0
87
@cloneofsimo
Simo Ryu
1 year
Huh, so it looks like triton's Flash Attention is significantly faster than torch's integrated SDPA flashattention (which is much faster than naive attention). This was done on 3070 Ti GPU
Tweet media one
4
16
81
@cloneofsimo
Simo Ryu
1 year
Cool paper from google! Exciting idea to use multiple latent per cross attention. There might be a room for correlated optimization, where some tokens being injected share multiple common embeddings.i.e., inject another common token t_s during optimizatio
Tweet media one
1
21
84
@cloneofsimo
Simo Ryu
10 months
Text2characters... Absolute madmans doing insane works... looks like they will be releasing code as well
3
16
84
@cloneofsimo
Simo Ryu
5 months
Recently, Karras demonstrated post-hoc ema method, where he was able to "simulate" arbitrary ema decaying factor after the training by saving two copies of ema and clever math. I took a deep breath to understand it, and wrote a tutorial + working example!
Tweet media one
1
12
80
@cloneofsimo
Simo Ryu
1 month
Now that my 5.4B model is stably training (pun intended), next goal is to deduplicate wds + filter + recaption. I've done my job on deduplication multiple times before, but here is my best attempt yet, fully following SD3's approach with SSCD emb Enjoy!
Tweet media one
8
10
81
@cloneofsimo
Simo Ryu
1 year
Since the authors didn't upload the code, here is my attempt at Prompt+! (below results is from my impl). Also further tested out the "correlated extended embedding" idea, which seems to be working (rather it is better or not is unclear)
Tweet media one
4
11
79
@cloneofsimo
Simo Ryu
1 year
Lucky enough to collaborate with @huggingface 's diffusers team (more like watching them implement🤣 I wrote no code) and... huge updates! Now LoRA is officially integrated with diffusers! There are major difference from my implementations, very simple to use!
@RisingSayak
Sayak Paul
1 year
Fine-tune Stable Diffusion in T4/V100 on a custom image-caption pairs' dataset 🧨 🔥 => memory efficiency This is enabled by LoRA. With LoRA, the fine-tuned checkpoints are just **3 MBs** in size 🤯 => portability Know about it👇
Tweet media one
2
43
284
3
15
79
@cloneofsimo
Simo Ryu
5 months
Even with 16 samples, FINE TUNING PERFORMS SIGNIFICANTLY BETTER THAN ICL!!! Everyone fine-tune your weights not discrete prompts!😋
Tweet media one
3
8
75
@cloneofsimo
Simo Ryu
29 days
Btw, this was done on int8 quantized dataset i shared couple weeks ago, which is 26x smaller than the original dataset!!! Imo clever dataset quantization has a lot to offer.
@cloneofsimo
Simo Ryu
30 days
Ok, my 5.4 B freaking-absolute-overkill ImageNet 1K rectified flow model is now finished. this is trained for 320K steps, with bs 128, meaning its SIGNIFICANTLY undertrained. However, it is lookin *very good* for its training budget. Also training was very stable, 0-loss-spike!
6
10
185
5
2
77
@cloneofsimo
Simo Ryu
1 month
Watch my compute optimal 5.4B rectified model go. I don't have to say this again, but... muP just makes everything easier.
Tweet media one
4
3
75
@cloneofsimo
Simo Ryu
14 days
"Oh the bitter lesson? Yeah I love the bitter lesson!" -- gpu rich
Tweet media one
4
4
76
@cloneofsimo
Simo Ryu
1 year
New trick that works insanely well! How would one mitigate spurious correlation that occurs during fine-tuning? Identify the dataset on the region of interest! [1/n]
Tweet media one
4
10
72
@cloneofsimo
Simo Ryu
10 months
I wouldn't have come up with using lora for dreambooth if I had beefy A100 gpu to play around 😂Now even the "GPU-rich" uses lora to fine-tune diffusion model.
@hardmaru
hardmaru
10 months
I prefer to operate in “GPU-Poor” mode. I don’t agree with the take from the semianalysis piece. Creative breakthroughs often occur under constraints—new systems, models, and methods that can better take advantage of even larger-scale compute
Tweet media one
72
135
1K
1
9
72
@cloneofsimo
Simo Ryu
1 year
Got my hands on it. Super easy to use, and some findings : 1. Works with Textual inversion, custom models, and LoRA. Incredible flexibility 2. Prompting + Guidance has non-negligible effect here. 3. Sub-second upscaling. Almost free lunch.
Tweet media one
1
10
71
@cloneofsimo
Simo Ryu
1 year
Sometimes the very existence of HF team is bit... unreal. Like imagine if we *didn't* have huggingface.
@RisingSayak
Sayak Paul
1 year
🧨 diffusers 0.17.0 is out and comes with new pipelines, improved LoRA support, `torch.compile()` speedups, and more ⏰ 🪄 UniDiffuser 🦄 DiffEdit ⚡️ IF DreamBooth 💡 Support for A1111 LoRA and more ... Release notes 📝 1/🧶
Tweet media one
6
60
303
2
3
68
@cloneofsimo
Simo Ryu
1 year
I officially graduated btw
Tweet media one
13
0
69
@cloneofsimo
Simo Ryu
16 days
Ok great day for progressive training today: One for diffusion : train the core t2i component efficiently, freeze it and train first / last layers later on One for LLM: block expansion for 50% speedup. Great stuff!!
Tweet media one
2
8
69
@cloneofsimo
Simo Ryu
10 months
Had such fun time putting this on @replicatehq via Cog with @allnoteson , @daannelson , @anotherjesse ! Fine tuning supports for all Dreambooth, Textual inversion, and LoRA. CLIPSeg masking, BLIP-captioning, SwinIR upscaling preprocessing! + entire thing open sourced.
Tweet media one
@replicate
Replicate
10 months
SDXL is the best open-source image model ever created. Now you can fine-tune it with your own images on Replicate.
5
21
134
8
6
68
@cloneofsimo
Simo Ryu
1 year
I'm want access to SDXL so badly... and while we don't yet have access to @StabilityAI latest model, SDXL, we *do* have access to newest VAE. .
5
7
64
@cloneofsimo
Simo Ryu
4 months
I think this is first ever open-reproduced results of > 1B scale of muP of @TheGregYang and @edwardjhu . following muP formula, you get to sweep on the 100M scale models and transfer successfully on 4B model (This sweep took 3 days on 8xA100 GPUs lol)
Tweet media one
4
12
65
@cloneofsimo
Simo Ryu
8 months
I feel like 2024 will be wild with consistency models.
@iScienceLuvr
Tanishq Mathew Abraham, Ph.D.
8 months
On arXiv now, I told y'all it was from @DrYangSong ! (and from @prafdhar too)
Tweet media one
1
26
199
5
2
67
@cloneofsimo
Simo Ryu
1 year
Just released version of 0.0.7! Thanks to all the contributors, now you can use different optimizers for embedding and LoRAs, benefit from textual inversion directly, inspect LoRAs, better module finder, fine-tune MLPs, use safetensors, and trainer CLIs!
Tweet media one
3
8
64
@cloneofsimo
Simo Ryu
1 year
This project (code not released yet) is awesome, but what in the world is "regularized DDIM inversion"? Is it literally imposing prior of the latent with scheduled normal distribution and Bayesian updating during inversion accordingly?
Tweet media one
4
9
64
@cloneofsimo
Simo Ryu
1 month
🤔interesting, I will give 1000$ to anyone that finds task where a larger batch size leads to more compute-efficient optimization. i.e., where figure like following is *not* monotonically decreasing.
Tweet media one
Tweet media two
@A_K_Nain
Aakash Kumar Nain
1 month
True and not true. How? It depends on the task. For example, in case of supervised learning it is true for many cases (still not all), and for contrastive learning a bigger batch is always preferable. Want more nuanced take? "Bigger batch size" is very relative, and very much…
3
1
27
10
7
64
@cloneofsimo
Simo Ryu
1 month
Waking up to see this.... feels good man
Tweet media one
4
1
60
@cloneofsimo
Simo Ryu
1 year
Ever dreamed of mixing models "during sampling"? With LoRA, now you can! If you fine-tune your model too much, it loses information on other stuff, so it loses general composability. Now, you might've wished for applying model A on first 25 steps, and model B on later 25. [1/n]
Tweet media one
5
3
61
@cloneofsimo
Simo Ryu
26 days
We need a better dataloader for pytorch, that is in a sense mix of MDS of @DbrxMosaicAI Webdataset, and sql We should be able to join data columns. We should be able to filter, (some sortof query language on the fly), in a efficient distributed manner...
7
4
61
@cloneofsimo
Simo Ryu
1 month
There is just something really cool about deep learning.
Tweet media one
4
2
59
@cloneofsimo
Simo Ryu
2 months
Effective free lunch I made today! Karras EMAing once in every K steps and adjusting beta respectively, is free lunch. (+ when you do on cpu-offloaded fashion, this is effectively zero-cost EMA!). Code ->
Tweet media one
2
9
59
@cloneofsimo
Simo Ryu
3 months
There is no such thing as silver bullet, and it all depends on the downstream domain. However, in many cases "nicely done" fine-tuning ALWAYS performs better than In-context learning. It's actually typically the case that LMs can be fine-tuned to perform better at ICL (metaicl)
Tweet media one
@minimaxir
Max Woolf
3 months
Extremely hot LLM take: you will often get better results with few-shot prompting (with good examples) on a modern LLM than with a finetuned LLM. Finetuning was the best option for weaker LLMs with lower context windows: both problems have been solved nowadays.
37
36
384
4
10
58
@cloneofsimo
Simo Ryu
2 months
This is the only ever legitimate use case of abstract algebra within deep learning research god damn its so cool.... (you know, not being one of those papers that use high level math just for the sake of it 🙄)
Tweet media one
5
3
55
@cloneofsimo
Simo Ryu
1 year
Ok I cannot believe this but this actually worked, given same number of steps, skipping 50% of the initial inversion steps (so that 0.5 < T < 1.0 steps are finer) helps inversion significantly... Check the code out if interested
Tweet media one
@cloneofsimo
Simo Ryu
1 year
If the unalignment between x_t and x_t+1 is large on beginning (x_T), why don't we use smaller DDIM steps at the later stage of DDIM inversion? i.e., reparametrize scheduler to have finer close to t ~ 1?
Tweet media one
2
1
19
3
5
56
@cloneofsimo
Simo Ryu
10 months
DDIM inversion pipeline: green horse -> fantasy black horse, masterpiece, 4 K
Tweet media one
0
3
57
@cloneofsimo
Simo Ryu
1 year
Importance of visual-language model, especially CLIP 📎 is ever-growing. I know that a lot of my followers are hardcore ML engineers/researchers, interested in multimodal training so here are set of very recent literatures in faster, better performing training of CLIPs. 🧵 [1/n]
4
6
56
@cloneofsimo
Simo Ryu
14 days
Bounded gaps between prime, by Yitang Zhang is arguably the most important paper towards solving the twin prime conjecture... has about 700 citations. Meanwhile, random LMM paper that gives you one-liner prompt engineering pro tips boosts +3 point on MMLU has 1k+ citations.
Tweet media one
4
7
59
@cloneofsimo
Simo Ryu
1 month
Just got the results!!! MMDiT 🤝muP. infinite width never disappoints 🫡 @TheGregYang Gradient norm: never blows up, Loss : never spikes, any scale! Feature updates: Maximal🌊🌊 The code to reproduce this ->
Tweet media one
2
4
53
@cloneofsimo
Simo Ryu
22 days
@Birchlabs @StefanABaumann @SeunghyunSEO7 @imbue_ai Ok, this is *ONLY* the beginning. While I was broadcasting these progress on twitter @FAL guys reached me out to plan on making this more powerful, and go on and build > 8B models from scratch, using better methods, better captioned datasets, everything! All open-sourced!
2
3
52
@cloneofsimo
Simo Ryu
1 month
Every modern large scale ML practitioner should read the following three papers imo 1. Scaling 2. Scaling batch size 3. Scaling in transfer Oh! they just happened to be all from open ai 🤔 no wonder 🤷‍♂️
1
8
52
@cloneofsimo
Simo Ryu
1 month
I can't be the only one to have missed this, but let me speculate GPT3.5 / GPT4 was trained on PowerSGD. Why? Because they turned to PowerSGD on Dalle-1 paper, and unless that really turned out to be good in that scale without much compromises, they simply wouldn't have done it
Tweet media one
1
7
50
@cloneofsimo
Simo Ryu
10 months
Personal update : I am delaying my masters in robotics and will be joining Naver (its the largest tech company in Korea) for next three month to do research on rlhf and build clova-x (Korean LLM). I will continue to build open source stuff on t2i and do side projects!
3
0
49
@cloneofsimo
Simo Ryu
10 months
Let's profile SDXL : 1. Quite something to see that Transformer module is still largely the bottleneck
Tweet media one
3
0
49
@cloneofsimo
Simo Ryu
5 months
Here is another project I worked on for past 3 weeks : Language reroll + FIM-LLAMA. llama with fill-in-the-middle capability + sleek interface you can use vLLM for context-aware document inpainting.
3
4
49
@cloneofsimo
Simo Ryu
11 months
So SDXL works great with prompt weighting! You have different text encoders, but have them prompt -weight separately, concatenate them, and sample. (val ranging from 0.8 to -0.8)
Tweet media one
2
2
49
@cloneofsimo
Simo Ryu
13 days
bros... just do turn on `split_by_worker` and ShardList to max out IO. It will not get faster otherwise, not prefetch_factor, no num_workers, non. idk why this worked, YOU WILL THANK ME LATER. I wasted 3 hours so just sharing. Tell me more if you know y
Tweet media one
2
0
51
@cloneofsimo
Simo Ryu
5 months
Here is a small project I bashed on this weekend : "ezmup" muP is effective weight init scheme everyone should use. With ezmup 3 LOC is all you need+ it's model agnostic!🔨 mup = Ezmup(width, model) mup.change_width_as(64) But... What *is* muP ? [1/n]
Tweet media one
2
8
48
@cloneofsimo
Simo Ryu
2 months
YES. DO. USE. MUP. At. ALL. COSTS. REGARDLESS. OF. YOUR. TASK. MODEL. ETC. -> use it right now see it transfer optimal lr right now,
@arankomatsuzaki
Aran Komatsuzaki
2 months
A Large-Scale Exploration of μ-Transfer Investigates µP empirically, which works as intended for the majority of important cases, from 2M to 10B parameters, with some outliers
Tweet media one
2
25
145
1
8
47
@cloneofsimo
Simo Ryu
24 days
3 days in, I see gradient norm sloooowly increasing. I found this to be the case with OLMO's training as well. Seriously, whats the good framework to explain this? Is this edge-of-stability happening IRL? 🤔🤔
Tweet media one
7
2
48
@cloneofsimo
Simo Ryu
11 months
SDXL 1.0 works very well with full fine tuning + textual inversion!
@anotherjesse
anotherjesse
11 months
We are already having lots of fun exploring fine-tuning of #sdxl on @replicatehq This WIP by @cloneofsimo really captures the essence of @zeke at work Looking forward to both enabling both fine-tuning and LoRA
Tweet media one
1
5
40
3
4
46
@cloneofsimo
Simo Ryu
1 year
That is some gigachad move right there, scaling GANs right when everyone is turning their heads to diffusions
@_akhaliq
AK
1 year
Scaling up GANs for Text-to-Image Synthesis present our 1B-parameter GigaGAN, achieving lower FID than Stable Diffusion v1.5, DALL·E 2, and Parti-750M. It generates 512px outputs at 0.13s, orders of magnitude faster than diffusion and autoregressive …
40
293
1K
2
5
46
@cloneofsimo
Simo Ryu
1 year
Another example : Mo-di model from @Nitrosocke distilled. prompt : "modern disney style, cute baby lion" Updated distillation will be available at v0.1.2!
Tweet media one
1
4
46
@cloneofsimo
Simo Ryu
1 year
True beauty of LoRA is that you can train one with v1.5., I can apply them to any SD model (of course, given they are similar enough). I applied Wednesday LoRA to @Nitrosocke 's redshift diffusion to get the following: So... Wednesday smiles?
Tweet media one
3
4
43