Has someone created materials around “fundamentals of ML for AI Engineers”, not focused on building models but things like evaluations, error analysis, etc
Maybe something already exists? I don’t want to do it lol - looking for a resource I can share with people
If you want to train everything from scratch,
1. Train VAE
2. Train CLIP
3. Train LLM
4. Using 3, train captioner based on CLIP.
5. Finetune dense captioner
6. Relabel text image pair
7. Train unet based on 1,6
8. Train pixel decoder
9. Train LLM for upsample caption
So year ago I introduced LoRA (which was at the time little known even to the LLM community, it was well before LLAMA / Peft) to image generation space.
Little did I realize year later thousands of deepfake waifu LoRAs would be flooding on the web... 🫥
My model is now ready to make thousands of consistent generations...
It's technically known as a LoRA (Low-Rank Adaptation), with SDXL as the base (foundation) model.
From here, two options are possible:
(i) Utilize your LoRA model independently,
(ii) Or blend this LoRA with…
This paper and their model is insane. Highly likely that these attention layers can be transferred to other fine-tuned models as well, which is truly groundbreaking feature for the SD community.
Did you know SDXL can be implemented with 520 lines of code in single file?
If you thought diffuser's unet code is now too big to understand in an hour, and wanted very limited but fully diffusers-compatible refactor of SDXL unet, this is for you
Personally, i feel very good today.
Achievement Unlocked: successfully train very large diffusion model from scratch, entirely on my own codebase! (of course, not like SD3 papers codebase is out or anything..)
YES!!!! TOOK 26 hours to make this happen: conditional D3PM implementation with pytorch. Let's accelerate discrete diffusion research!!! 👏I believe this is the only torch implementation of it out there!
Less than 400 LOC!
paper:
Here is a cool little hack I found with AnimateDiff: instead of just sampling, by introducing variance-preserving self-correlation in time axis, you can achieve "lesser flickering motion". corr = [0.9, 0.7, 0.2, 0.0(Just sampling)].
So you've had your fun with
@karpathy
's mingpt. Now its time to scale : introducing min-max-gpt: really small codebase that scales with help of
@MSFTDeepSpeed
. No huggingface accelerate, transformer. Just deepspeed + torch: maximum hackability
Again, the paper im advocating here is from openai, and is referenced all the time and frankly one of the paper all large scale practitioner should read. the math here isn't complicated and nothing here is either controversial nor task dependent.
Wondered how SD3 was trained? Me too 😅, but I tried my best to replicate that today!
Scalable transformer based rectified flow, following SD3's logit-normal sampler and llama-dit architecture.
Enjoy!
Hi, this is Lavenderflow-5.6B-v0.0
✅MMDiT, muP, CFM, FSDP, recaped, 768x768, T5
✅No strings attached, completely-open-every-step-of-the-way
✅Not SoTA😅(hey it was trained by one grad-student under total 3 weeks of development.) Severely undertrained!
I've managed to fine-tune Kandinsky 2.1 model. I think I'm the first one to get it done (because there is no doc on the repo and model structure is rather strange, and really not trivial to fine-tune). Model itself is really good as the FID promised.
At this point just so many SD related techs are getting pumped in its near impossible to catch up 🤣 either way, here goes another controlnet like model from tencet
I've ported t2i-adapter to be compatible to diffusers library, go ahead and use them! Example with Anythingv3 model + LoRA + T2I Adapter. (all with diffusers!)
Ok, my 5.4 B freaking-absolute-overkill ImageNet 1K rectified flow model is now finished. this is trained for 320K steps, with bs 128, meaning its SIGNIFICANTLY undertrained. However, it is lookin *very good* for its training budget. Also training was very stable, 0-loss-spike!
Uhh excuse me wtf LLAMA3 ranking 1st????? in lmsys arena in English? Kudos to team
@AIatMeta
, based AF 👏👏 for open sourcing literal GPT-4 level model, (almost) no strings attached🥳
Cannot emphasize this enough, but you only have to train LoRA once and you can apply them anywhere. Below case is with , which is pretty awesome model. Configs from
Normal people's hobby : listening to music, sports, video games...
Me : speedrun pretraining 5B T2I DiT from scratch under 3 weeks
RELEASING SOON!!!!! (btw this is pretrained ver, gotta train on hi-res)
Did you know Imagenet fits on your apple watch's RAM?
introducing imagenet.int8: 5GB, Cropped, VAEed, quantized version of imagenet, 26x compression in total, preprocessed in StreamingDataset format.
Enjoy.
In a equal compute budget, using larger batch almost always implies worse performance.
Rationale for using larger batch-size should always be for sake of faster convergence in equal *time*, not better performance in equal compute budget
But to be honest, there's been tons of low-rank, quantized gradient-approximation for efficient allgathers that the paper didn't mention for some reason. Like, not citing PowerSGD?? this? …Like man totally not cool 🙄
fig from psgd
GaLore
Memory-Efficient LLM Training by Gradient Low-Rank Projection
Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank
I managed to get it work! 2 Step, No progressive distillation as promised, reasonable quality for dumb unet structure and 10 min of training. I think this is the only implementation out there (given its like 4 days old) Not bad!
moon is high, model is 44k steps in, I stopped the run to check on everything and use multi-nodes, didn't expect *anything* at all.... However, safe to say, i've trained my FIRST ever 5.6B Text2Image MMDiT from scratch!!!
Fully fine-tuning SDXL on OW Kiriko images. This took about 10 min. Can you believe this is fine-tuned Base model? BASE????
@StabilityAI
is simply incredible.
"bUt iT woN't Be aS goOd wiTH yoUR teeNy coMpUte"
nah i dont care im not raising cash bro, gaining this experience of handling 100M-scale dataset, pretraining billion-scale vision model from scratch, post-hoc analysis... *all as a hobby in my free time*, is what matters 😎
Cool work, have a look! Interesting to see they tie the "probability" of discrete representation to, well, the probability of the dataset : Variational Inference itself.
So this might be the current best usable form of encoder based inversion for SD 2.X models, Really good in terms of fidelity, but NC license is bit sad.
Google presents Mixture-of-Depths
Dynamically allocating compute in transformer-based language models
Same performance w/ a fraction of the FLOPs per forward pass
Math is,,, incredible. I just fixed the learning rate faithful to muP suggested, now gradient norm is much more stable, my depression is cured, eyesight have improved, posture is better, and cured cancer.
📢 Introducing MPT: a new family of open-source commercially usable LLMs from
@MosaicML
. Trained on 1T tokens of text+code, MPT models match and - in many ways - surpass LLaMa-7B. This release includes 4 models: MPT-Base, Instruct, Chat, & StoryWriter (🧵)
Unlike Controlnet, T2i-adapter is lightweight, generalizable out-of-the-box, and is vey fast. It also does generate additional feature per-timestep. However, it seems to be less strict than Controlnet, thus one might prefer controlnet for truly fine-grained control.
How did I not know this before? download model from hf to local visible directory via
pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=True
huggingface-cli download TheBloke/Yi-34B-Chat-AWQ --local-dir ./yiawq
NO JOKE 100x speedup
First looks on training 0.9B IN1k model, 67k steps in, im already getting pretty decent quality images!! minRF is damn scalable with help of
@MSFTDeepSpeed
!
👉
[ rectified flow, muP, SDXL vae, MMDiT, cfg = 7.0!]
Huh, so it looks like triton's Flash Attention is significantly faster than torch's integrated SDPA flashattention (which is much faster than naive attention). This was done on 3070 Ti GPU
Cool paper from google!
Exciting idea to use multiple latent per cross attention. There might be a room for correlated optimization, where some tokens being injected share multiple common embeddings.i.e., inject another common token t_s during optimizatio
Recently, Karras demonstrated post-hoc ema method, where he was able to "simulate" arbitrary ema decaying factor after the training by saving two copies of ema and clever math.
I took a deep breath to understand it, and wrote a tutorial + working example!
Now that my 5.4B model is stably training (pun intended), next goal is to deduplicate wds + filter + recaption.
I've done my job on deduplication multiple times before, but here is my best attempt yet, fully following SD3's approach with SSCD emb
Enjoy!
Since the authors didn't upload the code, here is my attempt at Prompt+! (below results is from my impl).
Also further tested out the "correlated extended embedding" idea, which seems to be working (rather it is better or not is unclear)
Lucky enough to collaborate with
@huggingface
's diffusers team (more like watching them implement🤣 I wrote no code) and... huge updates! Now LoRA is officially integrated with diffusers! There are major difference from my implementations, very simple to use!
Fine-tune Stable Diffusion in T4/V100 on a custom image-caption pairs' dataset 🧨 🔥 => memory efficiency
This is enabled by LoRA. With LoRA, the fine-tuned checkpoints are just **3 MBs** in size 🤯 => portability
Know about it👇
Btw, this was done on int8 quantized dataset i shared couple weeks ago, which is 26x smaller than the original dataset!!! Imo clever dataset quantization has a lot to offer.
Ok, my 5.4 B freaking-absolute-overkill ImageNet 1K rectified flow model is now finished. this is trained for 320K steps, with bs 128, meaning its SIGNIFICANTLY undertrained. However, it is lookin *very good* for its training budget. Also training was very stable, 0-loss-spike!
New trick that works insanely well! How would one mitigate spurious correlation that occurs during fine-tuning? Identify the dataset on the region of interest! [1/n]
I wouldn't have come up with using lora for dreambooth if I had beefy A100 gpu to play around 😂Now even the "GPU-rich" uses lora to fine-tune diffusion model.
I prefer to operate in “GPU-Poor” mode.
I don’t agree with the take from the semianalysis piece. Creative breakthroughs often occur under constraints—new systems, models, and methods that can better take advantage of even larger-scale compute
Got my hands on it. Super easy to use, and some findings :
1. Works with Textual inversion, custom models, and LoRA. Incredible flexibility
2. Prompting + Guidance has non-negligible effect here.
3. Sub-second upscaling. Almost free lunch.
🧨 diffusers 0.17.0 is out and comes with new pipelines, improved LoRA support, `torch.compile()` speedups, and more ⏰
🪄 UniDiffuser
🦄 DiffEdit
⚡️ IF DreamBooth
💡 Support for A1111 LoRA
and more ...
Release notes 📝
1/🧶
Ok great day for progressive training today:
One for diffusion : train the core t2i component efficiently, freeze it and train first / last layers later on
One for LLM: block expansion for 50% speedup.
Great stuff!!
Had such fun time putting this on
@replicatehq
via Cog with
@allnoteson
,
@daannelson
,
@anotherjesse
! Fine tuning supports for all Dreambooth, Textual inversion, and LoRA. CLIPSeg masking, BLIP-captioning, SwinIR upscaling preprocessing! + entire thing open sourced.
I think this is first ever open-reproduced results of > 1B scale of muP of
@TheGregYang
and
@edwardjhu
. following muP formula, you get to sweep on the 100M scale models and transfer successfully on 4B model (This sweep took 3 days on 8xA100 GPUs lol)
Just released version of 0.0.7! Thanks to all the contributors, now you can use different optimizers for embedding and LoRAs, benefit from textual inversion directly, inspect LoRAs, better module finder, fine-tune MLPs, use safetensors, and trainer CLIs!
This project (code not released yet) is awesome, but what in the world is "regularized DDIM inversion"? Is it literally imposing prior of the latent with scheduled normal distribution and Bayesian updating during inversion accordingly?
🤔interesting, I will give 1000$ to anyone that finds task where a larger batch size leads to more compute-efficient optimization. i.e., where figure like following is *not* monotonically decreasing.
True and not true. How? It depends on the task. For example, in case of supervised learning it is true for many cases (still not all), and for contrastive learning a bigger batch is always preferable.
Want more nuanced take? "Bigger batch size" is very relative, and very much…
Ever dreamed of mixing models "during sampling"? With LoRA, now you can!
If you fine-tune your model too much, it loses information on other stuff, so it loses general composability. Now, you might've wished for applying model A on first 25 steps, and model B on later 25. [1/n]
We need a better dataloader for pytorch, that is in a sense mix of MDS of
@DbrxMosaicAI
Webdataset, and sql
We should be able to join data columns. We should be able to filter, (some sortof query language on the fly), in a efficient distributed manner...
Effective free lunch I made today! Karras EMAing once in every K steps and adjusting beta respectively, is free lunch. (+ when you do on cpu-offloaded fashion, this is effectively zero-cost EMA!).
Code ->
There is no such thing as silver bullet, and it all depends on the downstream domain. However, in many cases "nicely done" fine-tuning ALWAYS performs better than In-context learning. It's actually typically the case that LMs can be fine-tuned to perform better at ICL (metaicl)
Extremely hot LLM take: you will often get better results with few-shot prompting (with good examples) on a modern LLM than with a finetuned LLM.
Finetuning was the best option for weaker LLMs with lower context windows: both problems have been solved nowadays.
This is the only ever legitimate use case of abstract algebra within deep learning research god damn its so cool.... (you know, not being one of those papers that use high level math just for the sake of it 🙄)
Ok I cannot believe this but this actually worked, given same number of steps, skipping 50% of the initial inversion steps (so that 0.5 < T < 1.0 steps are finer) helps inversion significantly... Check the code out if interested
If the unalignment between x_t and x_t+1 is large on beginning (x_T), why don't we use smaller DDIM steps at the later stage of DDIM inversion? i.e., reparametrize scheduler to have finer close to t ~ 1?
Importance of visual-language model, especially CLIP 📎 is ever-growing. I know that a lot of my followers are hardcore ML engineers/researchers, interested in multimodal training so here are set of very recent literatures in faster, better performing training of CLIPs. 🧵 [1/n]
Bounded gaps between prime, by Yitang Zhang is arguably the most important paper towards solving the twin prime conjecture... has about 700 citations.
Meanwhile, random LMM paper that gives you one-liner prompt engineering pro tips boosts +3 point on MMLU has 1k+ citations.
Just got the results!!! MMDiT 🤝muP.
infinite width never disappoints
🫡
@TheGregYang
Gradient norm: never blows up,
Loss : never spikes, any scale!
Feature updates: Maximal🌊🌊
The code to reproduce this ->
@Birchlabs
@StefanABaumann
@SeunghyunSEO7
@imbue_ai
Ok, this is *ONLY* the beginning. While I was broadcasting these progress on twitter
@FAL
guys reached me out to plan on making this more powerful, and go on and build > 8B models from scratch, using better methods, better captioned datasets, everything! All open-sourced!
Every modern large scale ML practitioner should read the following three papers imo
1. Scaling
2. Scaling batch size
3. Scaling in transfer
Oh! they just happened to be all from open ai 🤔 no wonder 🤷♂️
I can't be the only one to have missed this, but let me speculate GPT3.5 / GPT4 was trained on PowerSGD.
Why? Because they turned to PowerSGD on Dalle-1 paper, and unless that really turned out to be good in that scale without much compromises, they simply wouldn't have done it
Personal update : I am
delaying my masters in robotics and will be joining Naver (its the largest tech company in Korea) for next three month to do research on rlhf and build clova-x (Korean LLM).
I will continue to build open source stuff on t2i and do side projects!
Here is another project I worked on for past 3 weeks : Language reroll + FIM-LLAMA. llama with fill-in-the-middle capability + sleek interface you can use vLLM for context-aware document inpainting.
So SDXL works great with prompt weighting!
You have different text encoders, but have them prompt -weight separately, concatenate them, and sample.
(val ranging from 0.8 to -0.8)
bros... just do turn on `split_by_worker` and ShardList to max out IO.
It will not get faster otherwise, not prefetch_factor, no num_workers, non.
idk why this worked, YOU WILL THANK ME LATER. I wasted 3 hours so just sharing. Tell me more if you know y
Here is a small project I bashed on this weekend : "ezmup"
muP is effective weight init scheme everyone should use. With ezmup 3 LOC is all you need+ it's model agnostic!🔨
mup = Ezmup(width, model)
mup.change_width_as(64)
But... What *is* muP ? [1/n]
A Large-Scale Exploration of μ-Transfer
Investigates µP empirically, which works as intended for the majority of important cases, from 2M to 10B parameters, with some outliers
3 days in, I see gradient norm sloooowly increasing. I found this to be the case with OLMO's training as well. Seriously, whats the good framework to explain this? Is this edge-of-stability happening IRL? 🤔🤔
We are already having lots of fun exploring fine-tuning of
#sdxl
on
@replicatehq
This WIP by
@cloneofsimo
really captures the essence of
@zeke
at work
Looking forward to both enabling both fine-tuning and LoRA
Scaling up GANs for Text-to-Image Synthesis
present our 1B-parameter GigaGAN, achieving lower FID than Stable Diffusion v1.5, DALL·E 2, and Parti-750M. It generates 512px outputs at 0.13s, orders of magnitude faster than diffusion and autoregressive …
Another example : Mo-di model from
@Nitrosocke
distilled.
prompt : "modern disney style, cute baby lion"
Updated distillation will be available at v0.1.2!
True beauty of LoRA is that you can train one with v1.5., I can apply them to any SD model (of course, given they are similar enough). I applied Wednesday LoRA to
@Nitrosocke
's redshift diffusion to get the following:
So... Wednesday smiles?