Sander Dieleman Profile
Sander Dieleman

@sedielem

50,525
Followers
1,541
Following
74
Media
1,848
Statuses

Research Scientist at Google DeepMind. I tweet about deep learning (research + software), music, generative models (personal account).

London, England
Joined December 2014
Don't wanna be here? Send us removal request.
Pinned Tweet
@sedielem
Sander Dieleman
2 months
New blog post! Some thoughts about diffusion distillation. Actually, quite a lot of thoughts 🤭 Please share your thoughts as well!
9
82
430
@sedielem
Sander Dieleman
2 years
A very common trick in neural net training, often omitted in papers: add a tiny number ε (e.g. 1e-10) to any quantity in a denominator or square root, so you don't divide by 0. My advice: always add ε! If it doesn't help, it won't hurt, and you might avoid a few NaN encounters👀
Tweet media one
30
187
2K
@sedielem
Sander Dieleman
1 year
Me: "NOOO, you can't just treat spectrograms as images, the frequency and time axes have completely different semantics, there is no locality in frequency and ..." These guys: "Stable diffusion go brrr"
@_akhaliq
AK
1 year
Riffusion, real-time music generation with stable diffusion @huggingface model: project page:
Tweet media one
64
628
3K
18
144
1K
@sedielem
Sander Dieleman
10 months
New blog post: perspectives on diffusion, or how diffusion models are autoencoders, deep latent variable models, score function predictors, reverse SDE solvers, flow-based models, RNNs, and autoregressive models, all at once!
16
203
885
@sedielem
Sander Dieleman
1 year
New blog post about diffusion language models: Diffusion models have completely taken over generative modelling of perceptual signals -- why is autoregression still the name of the game for language modelling? And can we do anything about that?
25
173
869
@sedielem
Sander Dieleman
6 years
Stacking WaveNet autoencoders on top of each other leads to raw audio models that can capture long-range structure in music. Check out our new paper: Listen to some minute-long piano music samples:
Tweet media one
Tweet media two
5
246
782
@sedielem
Sander Dieleman
1 year
First Riffusion, now this. Perhaps pixels are all you need🤔
@_akhaliq
AK
1 year
Image-and-Language Understanding from Pixels Only abs:
Tweet media one
12
223
873
14
87
607
@sedielem
Sander Dieleman
2 years
This paper is a goldmine for anyone training diffusion models, carefully picking apart theory and practice and showing which choices really matter. I was quite excited to see the authors of the StyleGAN series of papers tackle this topic, and boy do they deliver!
Tweet media one
@_akhaliq
AK
2 years
Elucidating the Design Space of Diffusion-Based Generative Models abs: improve efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55
Tweet media one
0
43
272
2
110
607
@sedielem
Sander Dieleman
3 months
This one's easy! That honour goes to "the diffusion bible", as I like to call it. It's been well over a year and I still refer to it several times a week. Very few papers I've read come close, in terms of signal-to-noise ratio.
Tweet media one
@sp_monte_carlo
Sam Power
3 months
what paper (not your own, maybe not even in your own area) can you not stop telling people about?
88
44
450
8
66
549
@sedielem
Sander Dieleman
6 years
"We conclude that the common association between sequence modeling and recurrent nets should be reconsidered, and convolutional nets should be regarded as a natural starting point for sequence modeling tasks." Great to see more work in this direction!
Tweet media one
8
197
528
@sedielem
Sander Dieleman
4 years
Very excited about the renewed focus on iterative refinement as a powerful tool for generative modelling! Here are a few relevant ICLR 2021 submissions: (image credit: ) (1/3)
5
108
520
@sedielem
Sander Dieleman
6 months
5-6 years ago I was working on music generation at DeepMind, but let me tell you, this is... something else. Incredibly excited to be able to finally share what our team has been working on!
@demishassabis
Demis Hassabis
6 months
Thrilled to share #Lyria , the world's most sophisticated AI music generation system. From just a text prompt Lyria produces compelling music & vocals. Also: building new Music AI tools for artists to amplify creativity in partnership w/YT & music industry
95
538
3K
18
38
495
@sedielem
Sander Dieleman
2 years
New blog post about the magic of diffusion guidance! Guidance powers the recent spectacular results in text-conditioned image generation (DALL·E 2, Imagen), so the time is right for a closer look at this simple, yet extremely effective technique.
10
97
452
@sedielem
Sander Dieleman
7 years
Harmonic networks ( @deworrall92 et al.) are fully rotation equivariant convnets. Very cool!
Tweet media one
Tweet media two
Tweet media three
4
167
395
@sedielem
Sander Dieleman
3 years
To synthesise realistic megapixel images, learn a high-level discrete representation with a conditional GAN, then train a transformer on top. Beautiful synergy between adversarial and likelihood-based learning! 🧵 (1/8)
@_akhaliq
AK
3 years
Taming Transformers for High-Resolution Image Synthesis pdf: abs: project page:
Tweet media one
7
101
490
4
84
389
@sedielem
Sander Dieleman
9 months
New blog post about the geometry of diffusion guidance: This complements my previous blog post on the topic of guidance, but it has a lot of diagrams which I was too lazy to draw back then! Guest-starring Bundle, the cutest bunny in ML 🐇
9
77
355
@sedielem
Sander Dieleman
1 year
New paper: continuous diffusion for categorical data We train diffusion language models with cross-entropy, using score interpolation instead of score matching. The training distribution of noise levels is adapted on the fly with time warping. (1/3)
Tweet media one
Tweet media two
5
74
342
@sedielem
Sander Dieleman
5 years
Likelihood is a great loss fn, it's all about the space you measure it in! Our latest work on hierarchical AR image models (w/ @JeffreyDeFauw , Karen Simonyan): We generated 128x128 & 256x256 samples for all ImageNet classes: (1/2)
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
99
336
@sedielem
Sander Dieleman
4 years
New blog post: 'Generating music in the waveform domain' A comprehensive overview of the field and some personal thoughts, based on a tutorial I gave at @ismir2019 with @jordiponsdotme and Jongpil Lee back in November. Comments / feedback welcome!
Tweet media one
8
113
296
@sedielem
Sander Dieleman
5 years
I will be at #NeurIPS2018 to present our work on music generation in the raw audio domain, using a stack of WaveNet autoencoders. Poster #87 on Tuesday Dec 4th, 5PM-7PM! Paper: Samples:
Tweet media one
2
76
296
@sedielem
Sander Dieleman
7 years
I've been working on WaveNet autoencoders with @GoogleBrain Magenta. blog post: paper:
Tweet media one
6
99
285
@sedielem
Sander Dieleman
1 year
End of year shower thought: Before AlexNet, we used layer-wise pre-training to train neural nets with >2 layers -- backprop just couldn't hack it. Diffusion and autoregression are the new layer-wise pre-training: decompose generation into many steps, train one step at a time!
9
19
243
@sedielem
Sander Dieleman
1 year
Batch normalisation appears to be falling out of favour (probably for the best IMO, so many bugs end up being batchnorm bugs😬). One area where it persists is GAN discriminators (e.g. in StyleGAN-T and VQGAN). Are there any other settings where batchnorm is still hard to avoid?
19
19
240
@sedielem
Sander Dieleman
3 years
🆕Variable-rate discrete representation learning🆕 We learn slowly-varying discrete representations of speech signals, compress them with run-length encoding, and train transformers to model language in the speech domain 🗣️ 📜 🔊
Tweet media one
Tweet media two
1
54
230
@sedielem
Sander Dieleman
5 months
With all the recent work on distilling diffusion models into single-pass models, I've been thinking a lot about diffusion model training as solving a kind of optimal transport problem🚐 (1/6)
3
25
228
@sedielem
Sander Dieleman
5 months
Parameterising neural nets to predict logits and training them using the cross-entropy loss function is an extremely effective combination. This setup works for diffusion models as well, by using score interpolation instead of score matching! See (§3.1)
@hi_tysam
Fern
5 months
The more I work in ML the more I feel like nearly any loss objective can, and should, be rephrased as its cross-entropy-based analog.
6
6
75
3
19
216
@sedielem
Sander Dieleman
1 month
10 years ago to the day, I published my first ML-related blog post: My blogging has been very sporadic over the years, but sharing what I've learnt has been very rewarding, and probably a pretty good career move as well😁 I highly recommend it!
3
15
216
@sedielem
Sander Dieleman
1 year
Two neat papers about diffusion for high-res images without cascading. Similar observations: - tuning the noise schedule is really important - the bulk of computation can be done on a significantly more compact representation
Tweet media one
Tweet media two
2
28
214
@sedielem
Sander Dieleman
4 years
WaveGrad generates waveforms from spectrograms by iteratively following the log-likelihood gradient. The surprising thing is that it needs as little as 6 steps to produce good quality audio! Seems like the resurgence of score matching is in full swing :)
Tweet media one
@heiga_zen
Heiga Zen (全 炳河)
4 years
Yet another neural vocoder from my team mates in Google Brain is out! The new model, "WaveGrad", is not autoregressive/Flow/GAN. It is based on score matching / diffusion probabilistic models. Check it please!!
2
62
314
0
42
211
@sedielem
Sander Dieleman
7 years
Lots of interesting work on "fixing" GANs right now: [1/3]
Tweet media one
Tweet media two
Tweet media three
3
82
208
@sedielem
Sander Dieleman
2 years
@A_K_Nain I think this is exacerbated by the fact that there are multiple formalisms (e.g. VAE-style, score-based, SDE, ...) and everything has 2-3 different names, depending on who you ask! I strongly recommend @YSongStanford 's compendium (with Python notebooks!):
2
32
201
@sedielem
Sander Dieleman
8 months
This work shows scalar quantisation is competitive with VQ across a range of tasks, but simplifies things a lot: no codebook collapse, no EMA updates, ... because no codebook! I've been a fan of scalar quantisation for a while, see
@_akhaliq
AK
8 months
Finite Scalar Quantization: VQ-VAE Made Simple paper page: propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a…
Tweet media one
4
49
243
6
31
190
@sedielem
Sander Dieleman
4 years
Good advice! For classification models, a scatter plot of the cross-entropy loss vs. prediction entropy (~confidence) for individual examples can be very revealing. More generally: study model behaviour for individual data points, don't look at aggregate statistics exclusively.
@karpathy
Andrej Karpathy
4 years
When you sort your dataset descending by loss you are guaranteed to find something unexpected, strange and helpful.
30
227
2K
1
36
189
@sedielem
Sander Dieleman
6 years
Invertible neural networks are really cool! Check out this excellent blog post about a new paper where they are used to analyse inverse problems: paper: (1/4)
Tweet media one
Tweet media two
Tweet media three
6
59
186
@sedielem
Sander Dieleman
8 years
I've uploaded my PhD thesis "Learning feature hierarchies for musical audio signals", which I defended in January:
5
48
182
@sedielem
Sander Dieleman
6 months
At the end of the summer, I gave an invited talk at the @M2lSchool in Thessaloniki about training neural networks. It's a bit of a jumble of ideas, suggestions and best practices I've amassed over the years, interspersed with concrete examples.
4
28
183
@sedielem
Sander Dieleman
8 years
tl;dr: connect every CNN layer to every other layer. Simple but effective idea, well-written paper. Worth a read!
@brandondamos
Brandon Amos ✈️ ICLR
8 years
Densely Connected Convolutional Networks
Tweet media one
Tweet media two
2
66
95
3
63
178
@sedielem
Sander Dieleman
1 year
If diffusion model sampling tries your patience, check out consistency models: single-step sampling! No adversarial loss! In addition to being a very cool idea, this paper significantly leans on the formalism from Karras et al. 2022 AKA my favourite diffusion paper😁 Neat!
@_akhaliq
AK
1 year
Consistency Models achieve the new state-of-the-art FID of 3.55 on CIFAR10 and 6.20 on ImageNet 64 ˆ 64 for one-step generation abs:
Tweet media one
8
98
437
0
33
175
@sedielem
Sander Dieleman
4 years
Neat idea: if you fit augmentation params with gradient descent (jointly with model params) using a prior that gently encourages more augmentation, they will naturally drift towards the maximal sensible values, which correspond to the degree of invariance exhibited by the data.
@andrewgwils
Andrew Gordon Wilson
4 years
Translation equivariance has imbued CNNs with powerful generalization abilities. Our #NeurIPS2020 paper shows how to *learn* symmetries -- rotations, translations, scalings, shears -- from training data alone! w/ @g_benton_ , @Pavel_Izmailov , @m_finzi . 1/9
6
90
409
0
31
175
@sedielem
Sander Dieleman
4 years
New blog post, in which I wax lyrical about typicality and the curse of dimensionality: I tweeted about this concept a while back, but it turns out I have more to say on the topic. It's a bit more speculative than what I usually write, hope you like it!
3
48
165
@sedielem
Sander Dieleman
3 months
I've got a blog post brewing... maybe even two blog posts! They are about diffusion models🙃
4
5
161
@sedielem
Sander Dieleman
5 months
10 years ago today: @avdnoord and I presenting our audio-based music recommendation demo at @NeurIPSConf 2013! We went on to intern at Spotify & Google Play Music the next summer (blog post: ), and by summer 2015, we had both joined @GoogleDeepMind .
Tweet media one
5
4
162
@sedielem
Sander Dieleman
7 years
The TF wrapper we use internally at DeepMind has been open sourced. Lasagne users might like this one, it shares a lot of design principles.
@GoogleDeepMind
Google DeepMind
7 years
Excited to release #Sonnet - a library for constructing complex Neural Network models in TensorFlow. Get started:
Tweet media one
4
473
714
1
79
160
@sedielem
Sander Dieleman
7 years
Google Assistant is now powered by WaveNet!
3
41
158
@sedielem
Sander Dieleman
4 years
Our latest work on GANs for text-to-speech, from characters/phonemes to waveforms with a single model. Learning varying alignment without teacher forcing is tricky, but we found dynamic time warping (DTW) to be very effective.
@GoogleDeepMind
Google DeepMind
4 years
In our new paper [] we propose EATS: End-to-End Adversarial Text-to-Speech, which allows for speech synthesis directly from text or phonemes without the need for multi-stage training pipelines or additional supervision. Audio:
Tweet media one
8
200
706
2
36
154
@sedielem
Sander Dieleman
4 years
A concept that really helped me to understand the behaviour of likelihood-based sequence models is "typicality": It was originally defined in an information-theoretic context, but it is equally relevant in machine learning. (1/4)
1
44
152
@sedielem
Sander Dieleman
2 years
Diffusion models work by learning to invert a process that gradually destroys information, step-by-step. Adding Gaussian noise is only one way to construct such a process, here's another: running the heat equation across the spatial dimensions of the image gradually blurs it.
@arnosolin
Arno Solin
2 years
🔥 'Generative modelling with inverse heat dissipation' 🔥 \w Severi and @HeinonenMarkus . A model that learns to generate images by inverting a PDE that effectively 'blurs' an image and comes with appealing properties. 📄 🎬 [1/6]
12
87
451
3
16
154
@sedielem
Sander Dieleman
1 year
@BlackHC This is all you need to know
@videodrome
Robbie Barrat
6 years
I'm laughing so hard at this slide a friend sent me from one of Geoff Hinton's courses; "To deal with hyper-planes in a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly. Everyone does it."
Tweet media one
25
760
2K
3
4
146
@sedielem
Sander Dieleman
2 years
In addition to being autoencoders, diffusion models are also RNNs. I quite like the perspective of diffusion models as simply a way to train very deep generative nets with something that scales better than backpropagation. Oh and BTW diffusion models are also normalising flows🙃
@jxmnop
jack morris
2 years
Diffusion is just an easy-to-optimize way to give neural networks adaptive computation time. Makes sense then that diffusion models beat GANs, which only get one forward pass to generate an image. have to wonder what other ways there are to integrate for loops into NNs...
24
49
604
3
13
142
@sedielem
Sander Dieleman
10 months
We will be hosting the Machine Learning for Audio workshop🔊🎶 at #NeurIPS2023 in New Orleans in December! Submission deadline: September 29. Cool things are happening in this space🚀so join us if you can and spread the word! Speakers, schedule, etc.:
1
35
140
@sedielem
Sander Dieleman
4 years
Some thoughts about "Scaling Laws for Autoregressive Generative Modeling" by Henighan et al. (). It's a lot to take in, but highly recommended reading! (1/5)
Tweet media one
1
17
136
@sedielem
Sander Dieleman
5 months
I read the adversarial diffusion distillation paper recently ( it's neat, check it out!), and realised it's probably the first paper in many months that I've actually read all the way through! What should I be reading on the way to #NeurIPS2023 ? ✈️
5
13
138
@sedielem
Sander Dieleman
5 years
I will be at #ICLR2019 this week, find me if you want to talk about generative models and/or ML for audio/music 🎵. Also make sure to check out the poster and talk for MAESTRO on Tuesday at 10AM!
Tweet media one
1
33
137
@sedielem
Sander Dieleman
2 years
When working on WaveNet, we noticed there is a "critical model size" at which point it suddenly starts working well -- smaller models basically don't work at all. In retrospect, I suppose this is another instance of "sudden emergence". This probably applies to all AR models.
@AlexTamkin
Alex Tamkin 🦣
2 years
Why do certain capabilities seem to suddenly emerge in LLMs? One possibility: Even if your probability of predicting the next token correctly goes up gradually (x-axis), Your probability of getting a *multi-token* output correct can shoot up really quickly (y=x^k)
3
12
150
4
14
136
@sedielem
Sander Dieleman
7 years
PatternNet & PatternLRP: nice work from a former colleague on interpreting neural network classification decisions.
Tweet media one
Tweet media two
0
46
132
@sedielem
Sander Dieleman
3 years
JAX's clean, compact APIs and its powerful function transformations (𝚟𝚖𝚊𝚙 ALL the things!), combined with DeepMind's adoption of "incremental buy-in" as a philosophy underpinning our software infrastructure, have had a huge positive impact on my work.
@GoogleDeepMind
Google DeepMind
3 years
In a new blog post, @davidmbudden and @matteohessel discuss how JAX has helped accelerate our mission, and describe an ecosystem of open source libraries that have been developed to make JAX even better for machine learning researchers everywhere:
Tweet media one
4
118
567
1
16
132
@sedielem
Sander Dieleman
5 months
I've been fascinated by Aapo's work on score matching since back when I was doing my PhD. Of course back then, the best application we could think of was training restricted Boltzmann machines🙃 I always had a feeling we would see score matching resurface at some point!
@volokuleshov
Volodymyr Kuleshov 🇺🇦
5 months
It's crazy how many modern generative models are 15-year old Aapo Hyvarinen papers. Noise contrastive estimation => GANs Score matching => diffusion Ratio matching => discrete diffusion If I were a student today, I'd carefully read Aapo's papers, they’re a gold mine of ideas.
Tweet media one
10
117
1K
2
16
133
@sedielem
Sander Dieleman
5 years
More progress in flow-based models! tl;dr: use masking as in autoregressive flows to get triangular Jacobians (so they can be computed analytically, avoiding power series), but use fixed point iteration for fast inversion as in i-ResNet / residual flows.
@DrYangSong
Yang Song
5 years
Releasing our paper on MintNet! It's a new flow model built by replacing normal convolutions in ResNets with masked convolutions. It has exact likelihood, fast sampling with fixed-point iteration, and better performance than published results on MNIST, CIFAR-10 and small ImageNet
Tweet media one
Tweet media two
Tweet media three
Tweet media four
4
58
259
2
25
130
@sedielem
Sander Dieleman
3 years
Unsupervised speech recognition🤯 a conditional GAN learns to map pre-trained and segmented speech audio features to phoneme label sequences. It is trained only to produce realistic looking words and sentences -- no need for any labeled data. Amazed at how well this works!
@MichaelAuli
Michael Auli
3 years
Today we are announcing our work on building speech recognition models without any labeled data! wav2vec-U rivals some of the best supervised systems from only two years ago. Paper: Blog: Code:
15
314
1K
1
25
127
@sedielem
Sander Dieleman
3 years
This work has the promise of freeing us from the technical debt that normalisation layers bring with them. I've often wondered if the gains from BatchNorm are offset by the myriad of bugs it has the potential to introduce 🤔 Now we can get competitive results without it!
@ajmooch
Andy Brock
3 years
Normalizer-Free ResNets: Our ICLR2021 paper w/ @sohamde_ & @SamuelMLSmith We show how to train deep ResNets w/o *any* normalization to ImageNet test accuracies competitive with ResNets, and EfficientNets at a range of FLOP budgets, while training faster.
Tweet media one
Tweet media two
8
87
409
2
30
128
@sedielem
Sander Dieleman
2 months
The way overfitting is usually taught: you underfit for a while, then at some point, you start overfitting. This "phase transition" perspective can be misleading. As Alex points out, you can have both at the same time. It's probably more useful to think of it as a trade-off.
Tweet media one
@unixpickle
Alex Nichol
2 months
I'm surprised how few people realize it's possible to underfit and overfit at the same time.
2
2
60
5
7
127
@sedielem
Sander Dieleman
6 months
Fun thread about the magic of diffusion🙂 ...though I can't resist pointing out that this glosses over an important fact: 99% of bits in images are not perceptually relevant, diffusion is good at modelling the 1% that matter. Blog post with more details:
@quantian1
Quantian1
6 months
The fact that it is actually perfectly general, is nothing short of astounding. It’s like watching a street magician pull a handkerchief from his nose, and then for his next trick he astral projects you to the realm of Platonic forms.
5
42
846
1
21
127
@sedielem
Sander Dieleman
10 months
It's been a while, so I thought I'd write a quick blog post about some different perspectives on diffusion models this weekend, but it's already grown to 10 sections and shows no signs of abating. Short-form blogging just isn't my style, I suppose 🙃 Coming soon...ish!
5
4
119
@sedielem
Sander Dieleman
5 months
At the latent diffusion tutorial panel yesterday, I briefly mentioned the difficulties of training autoencoders on language data. Today at the poster session, I found this paper. Looks like they've figured out a way to make this work! (§4.1) #NeurIPS2023
1
9
116
@sedielem
Sander Dieleman
5 months
If you're at @NeurIPSConf , come check this out our demo at the @GoogleDeepMind booth on Wednesday at noon, we've got some cool stuff to share! 🎶 #NeurIPS2023
Tweet media one
1
11
113
@sedielem
Sander Dieleman
8 years
Our paper about exploiting cyclic symmetry in convnets was accepted at ICML!
Tweet media one
1
42
111
@sedielem
Sander Dieleman
2 years
Amazing audio generation results using a two-level approach: a semantic (low-rate) and an acoustic (higher-rate) representation, learnt separately, are combined to hierarchically generate waveforms with long-range coherence. Very impressive speech and piano continuations!
@_akhaliq
AK
2 years
AudioLM: a Language Modeling Approach to Audio Generation abs: project page:
Tweet media one
4
72
372
2
16
109
@sedielem
Sander Dieleman
3 years
Flow-based models are usually less expressive because the Jacobian needs to be easily invertible. Keller et al. train both forward and inverse models, matching them using cycle-consistency. Then the Jacobian of one model can be used in the loss of the other, no inversion needed!
@t_andy_keller
Andy Keller
3 years
Excited to share my first paper! Self Normalizing Flows -- An efficient training method for unconstrained normalizing flows. Joint work w/ the ever supportive @jornpeters , @priyankjaini , @emiel_hoogeboom , Patrick Forré & @wellingmax 1/5
Tweet media one
10
54
428
2
20
109
@sedielem
Sander Dieleman
2 years
Neat idea: to apply diffusion models to discrete data, map the discrete symbols to binary patterns. This paper also contains a few tricks that have the potential to improve diffusion models across the board, most notably "self-conditioning". Worth a read!
@_akhaliq
AK
2 years
Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning abs: first represent the discrete data as binary bits, and then train a continuous diffusion model to model these bits as real numbers called analog bits
Tweet media one
3
33
199
1
13
110
@sedielem
Sander Dieleman
2 years
"Negative results" appendices are great and I wish they were more common! Especially in empirically driven ML research, where the devil is so often in the details. That said: just because someone says something didn't pan out, doesn't necessarily mean you can't make it work🙃
@surajkothawade
suraj kothawade
2 years
I wish research papers have a section in the appendix that is titled - "What did not work". Although the main paper should outline "what works", it's worth writing about the series of failed experiments.
41
229
2K
3
5
104
@sedielem
Sander Dieleman
4 years
Neat trick for faster (parallel) sampling from autoregressive models: treat it as solving a triangular system of nonlinear equations and use fixed point iteration, instead of sampling step by step.
@DrYangSong
Yang Song
4 years
Excited to share our paper on accelerating feedforward computations in ML — such as evaluating a DenseNet or sampling from autoregressive models — via parallel computing. Speedup factors are around 1.2–33 under various conditions and computation models.
5
131
628
1
18
103
@sedielem
Sander Dieleman
1 month
Text-to-music is having a moment👀 The team behind Udio are some of the brightest and most goal-driven people I've had the pleasure to work with, before they went on to found Uncharted Labs. Amazing to see the fruits of their labour out in the open!
@udiomusic
udio
1 month
Introducing Udio, an app for music creation and sharing that allows you to generate amazing music in your favorite styles with intuitive and powerful text-prompting. 1/11
861
1K
6K
0
8
101
@sedielem
Sander Dieleman
2 years
Diffusion models for language have mostly used discrete diffusion processes (e.g. D3PM, ARDM, SUNDAE, DiffusER, ...), but if you want to stick with continuous diffusion, you can simply embed token sequences first. This works pretty well as it turns out, even at scale!
@_akhaliq
AK
2 years
Self-conditioned Embedding Diffusion for Text Generation abs: propose SED, the first generally-capable continuous diffusion model for text generation
Tweet media one
1
43
188
4
19
102
@sedielem
Sander Dieleman
8 years
New ResNet results from He et al.: put ReLU/batchnorm before weight layers instead of after!
Tweet media one
0
59
98
@sedielem
Sander Dieleman
8 years
A human rendition of one of the #WaveNet piano samples, and some detailed analysis from Magenta:
Tweet media one
0
42
94
@sedielem
Sander Dieleman
6 years
Autoregressive models like PixelCNN don't necessarily have to be trained using maximum likelihood. Here's an interesting alternative from several of my colleagues!
@GoogleDeepMind
Google DeepMind
6 years
Autoregressive Quantile Networks for Generative Modeling:
1
76
257
1
22
94
@sedielem
Sander Dieleman
2 years
The point of doing this in a square root is slightly less obvious: the derivative of √x w.r.t. x is 1/(2√x). Don't ignore this one unless you like exploding gradients! My code always ends up thoroughly seasoned with εs🧂 Are there any other situations where adding ε is useful?
6
2
96
@sedielem
Sander Dieleman
5 months
Diffusion circle at @NeurIPSConf : let's meet at 2:30pm on Thursday (tomorrow!) outside Hall E (Gate 10B) and then find a place to sit and have a chat. We'll do it old school and just sit on the floor somewhere. I'll share location updates live! Tell your friends! #NeurIPS2023 📢
4
12
94
@sedielem
Sander Dieleman
11 months
Interesting alternative derivation of diffusion models without differential equations or variational inference. Reminiscent of flow matching / rectified flow. Which of these perspectives is the simplest is subjective IMO, but more is better: new perspectives inspire new ideas!
@eric_heitz
Eric Heitz
11 months
When @_Laurent and I started learning about diffusion models, we were puzzled by the amount of jargon and concepts. So, we derived a model from scratch with our own graphics-people intuitions. Simple derivation, simple implementation, SOTA quality.
Tweet media one
33
355
2K
2
11
89
@sedielem
Sander Dieleman
10 months
📢 diffusion circle! 📢 As is becoming tradition, let's talk diffusion/iterative refinement at #ICML2023 ! Let's meet at registration 3:00pm on Thursday July 27 and find a spot to sit and chat. Please share and tag anyone at the conference who might be interested!
8
11
90
@sedielem
Sander Dieleman
8 years
Presenting our poster on cyclic symmetry in CNNs this afternoon at #ICML2016 ! (With @JeffreyDeFauw and @koraykv )
Tweet media one
1
39
91
@sedielem
Sander Dieleman
11 months
Neat idea: distill LLMs with reverse instead of forward KL, so the student overgeneralises less. A bit more involved, but it seems to pay off! Reminiscent of probability density distillation (), but this method works for categorical distributions.
@arankomatsuzaki
Aran Komatsuzaki
11 months
Knowledge Distillation of Large Language Models - Proposes MiniLLM that distills smaller language models from generative larger language models - Scalable for different model families with 120M to 13B parameters repo: abs:
Tweet media one
5
75
288
1
14
86
@sedielem
Sander Dieleman
5 months
Who's at @NeurIPSConf next week? I'll be on the panel for on Mon Dec 11, I'm coorganising on Sat Dec 16, and I'll be hanging out near the @GoogleDeepMind booth througout the week. Keen to chat about generating stuff! #NeurIPS2023
10
4
88
@sedielem
Sander Dieleman
1 month
This blog post is an amazing exposition and analysis of consistency models, and how they relate to diffusion models, leading to several suggested improvements to the training procedure that look very promising. Definitely worth a read!
@ZhengyangGeng
Zhengyang Geng
1 month
🚀Our latest blog post unveils the power of Consistency Models and introduces Easy Consistency Tuning (ECT), a new way to fine-tune pretrained diffusion models to consistency models. SoTA fast generative models using 1/32 training cost! 🔽 Get ready to speed up your generative…
Tweet media one
7
47
143
1
12
88
@sedielem
Sander Dieleman
4 years
I believe I've only encountered self-similarity matrices in the context of music structure analysis until now. This is a really neat application of the idea: counting repetitions in videos.
@debidatta
Debidatta Dwibedi
4 years
Introducing RepNet, a model that counts repetitions in videos of *any* action w @yusufaytar , @JonathanTompson , @psermanet and Andrew Zisserman Paper: Project: Video: #CVPR2020 #computervision #deeplearning
16
125
485
4
16
85
@sedielem
Sander Dieleman
11 months
Making diffusion language models work as well as autoregressive ones will be a challenge (see my earlier blog post: ). This paper quantifies this and finds a 64x efficiency disadvantage across all scales 👀 a big gap, but at least it's a constant factor!
@__ishaan
Ishaan Gulrajani
11 months
New paper with @tatsu_hashimoto ! Likelihood-Based Diffusion Language Models: Likelihood-based training is a key ingredient of current LLMs. Despite this, diffusion LMs haven't shown any nontrivial likelihoods on standard LM benchmarks. We fix this!🧵
Tweet media one
8
37
254
4
18
87
@sedielem
Sander Dieleman
3 years
@DavidSKrueger Absolutely! Score-based / diffusion-based generative models are basically denoising autoencoders. Sure, they predict ε from x + ε instead of predicting x, but that's just a question of adding a residual connection🙂
2
3
87
@sedielem
Sander Dieleman
2 years
@Thom_Wolf Classifier-free guidance is a cheatcode that makes these models perform as if they had 10x the parameters. At least in terms of sample quality, and at the cost of diversity. All of the recent spectacular results rely heavily on this trick.
1
2
85
@sedielem
Sander Dieleman
2 years
This is neat: state-space models (S4-style) for raw audio! WaveNet's dilated convolutions are an elegant architectural prior for waveforms, which I'm still very fond of, but this clearly wins in terms of param efficiency. Includes a bidirectional extension to diffusion models🥳
@arankomatsuzaki
Aran Komatsuzaki
2 years
It's Raw! Audio Generation with State-Space Models Achieves SotA perf on autoregressive unconditional waveform generation. proj: repo: abs:
Tweet media one
1
23
161
3
8
85
@sedielem
Sander Dieleman
1 year
Who's coming to #NeurIPS2022 ? ICML was great, but attendance was, understandably, a tad sparse... I'm looking forward to (re)connecting with more people this time around! Keen to talk about generative models, iterative refinement, diffusion, that sort of thing🤓
8
1
84
@sedielem
Sander Dieleman
4 years
On Thursday Aug 20, I'm speaking about generating music in the waveform domain at the Vienna deep learning meetup! Virtually of course, from my desk in London 🙂 Sign up to attend: I'll cover most of plus some recent developments!
2
20
82
@sedielem
Sander Dieleman
1 year
Rumours of GANs' demise have been greatly exaggerated, part 2
@_akhaliq
AK
1 year
Scaling up GANs for Text-to-Image Synthesis present our 1B-parameter GigaGAN, achieving lower FID than Stable Diffusion v1.5, DALL·E 2, and Parti-750M. It generates 512px outputs at 0.13s, orders of magnitude faster than diffusion and autoregressive …
40
294
1K
3
5
83
@sedielem
Sander Dieleman
3 years
This work shows how you can sample from autoregressive models with Langevin dynamics, the same iterative refinement approach that powers sampling from score- & diffusion-based models! With this, you can use AR models like WaveNet for denoising, inpainting and source separation.
@vivjay30
Vivek Jayaram
3 years
Excited to share our new paper to appear at @icmlconf ! We show a new way to sample from an autoregressive model like Wavenet. Using Langevin sampling, we can solve many tasks like super-resolution, inpainting, or separation with the same network. Website:
3
35
162
0
7
81
@sedielem
Sander Dieleman
4 years
We've updated the EATS paper on arXiv: 'End-to-end' has many possible interpretations – Table 5 in the appendix (p. 21) describes some of the many ways in which the TTS pipeline has been factorised into stages in the literature, for easier comparison.
Tweet media one
@GoogleDeepMind
Google DeepMind
4 years
In our new paper [] we propose EATS: End-to-End Adversarial Text-to-Speech, which allows for speech synthesis directly from text or phonemes without the need for multi-stage training pipelines or additional supervision. Audio:
Tweet media one
8
200
706
2
25
80
@sedielem
Sander Dieleman
1 year
This is definitely a problem with AR waveform models, which produce very long sequences (~10^6 steps) and are prone to "going off the rails". It's clearly not been much of an issue with language models so far, but I suppose it could be in the long run! Diffusion it is, then?😁
@ylecun
Yann LeCun
1 year
I have claimed that Auto-Regressive LLMs are exponentially diverging diffusion processes. Here is the argument: Let e be the probability that any generated token exits the tree of "correct" answers. Then the probability that an answer of length n is correct is (1-e)^n 1/
Tweet media one
220
541
3K
9
7
78
@sedielem
Sander Dieleman
2 years
Audio folk! We are hosting a workshop at ICML this year, and we're very keen to hear what you've been working on. Especially if it generates speech🗣️, music🎶, bird song🐦, rain sounds🌧️, traffic noise🚘, or anything in between! Submissions are due by May 11 (up to 4 pages).
@KulisBrian
Brian Kulis
2 years
Announcing the Workshop on Machine Learning for Audio Synthesis at #ICML2022 @icmlconf ! Paper submissions on all aspects of audio generation/synthesis using ML welcome. Webpage: Organizers: @sedielem , Yu Zhang, @rmanzelli , @saddlepoint18 , @KulisBrian
1
26
73
0
16
79
@sedielem
Sander Dieleman
3 years
I've previously discussed the importance of measuring likelihoods in the right space in a blog post () and on Twitter (e.g. ). (5/8)
@sedielem
Sander Dieleman
3 years
Measuring likelihoods in the right representation space is important, and we need prior knowledge to find that space. This work formalises this argument for anomaly detection, and also demonstrates that looking at typicality instead of density isn't enough for reliable detection.
0
13
53
1
8
78