Konstantin Mishchenko @konstmish Twitter profile

Pinned Tweet

Konstantin Mishchenko

4 months

Is optimization still a good topic to work on for a PhD? Here is my detailed response to a student who reached out asking this question.

11

84

776

Last Seen Profiles

@AParatha4

@leytonkins

@jamiesue

@HebsgaardNiels

@Der_Postillon

@PalantePartido

@dotdump

@haikal_brondong

@marcusclarkej01

@commonwealthorg

@nkokkalis

@Susu_jpg

@PaulKetchley

@tadpokes

@DeeHouseMF

@hazemninstalker

@mtill50

@DiceSports

@RaidenOp4

@SabrinaAnnLynn

@mckinneyisd

@cu_malawi

@lugaricano

@Tranby_MFL

@jacktwheatley

@lightshow_ya

@historyscientis

@kono7kagetsu

@Paritaet

@Baseball_Durant

@ATasarov

@harrytlambert

@kinohako_

@komadahachi

@RumbleBunnies

@markandcharlie

Konstantin Mishchenko

@konstmish

3 years

The source of information most useful to me during my PhD was neither a book nor a lecture series nor a paper. It was this simple cheatsheet by @fpedregosa : His other blog posts are great too!

7

130

838

Konstantin Mishchenko

@konstmish

2 years

Let me share a result that surprised even Yurii Nesterov. You know how Taylor series become more accurate as you use more terms? Well, it turns out that in convex optimization, 3rd derivatives might be completely useless Let me explain. Thread 1/5

12

159

814

Konstantin Mishchenko

@konstmish

1 year

Ever wanted to understand the convergence theory of SGD? Probably best to start with this new paper () that collects all simple proofs in one place: momentum, minibatching, prox, averaging, and more. Covers nonconvex and nonsmooth cases too.

Handbook of Convergence Theorems for (Stochastic) Gradient Methods

This is a handbook of simple proofs of the convergence of gradient and stochastic gradient descent type methods. We consider functions that are Lipschitz, smooth, convex, strongly convex, and/or...

arxiv.org

7

117

613

Konstantin Mishchenko

@konstmish

9 months

Michael I. Jordan on ChatGPT and AI: "We're fearful in 10 years it can get out of control and all that. That's just not true. ChatGPT is predicting the next word in a sentence, just remember that's what it does". 1/3

Expert Explains Why “Artificial Intelligence” is WRONG | DEEP...

Michael I. Jordan studies electrical engineering and machine learning at UC Berkeley. He has expressed frustration with the current state of AI discourse – n...

www.youtube.com

36

70

444

Konstantin Mishchenko

@konstmish

4 months

I'd like to start 2024 with some thoughts on the state of optimization theory. To begin with, let me quote Nesterov's 2003 book: "The main fact, which should be known to any person dealing with optimization models, is that, in general, the optimization problems are unsolvable" 🧵

3

75

435

Konstantin Mishchenko

@konstmish

6 months

Why do we need warm-up, cosine annealing, and other learning rate schedules when training with gradient descent? It turns out it's all about how gradient norms change over time. E.g., large norms at the start => warm-up. Slow decrease => cosine. Paper: 0/4

3

72

420

Konstantin Mishchenko

@konstmish

3 years

Fun fact 1 (standard): computing matrix inverse A^(-1) has the same complexity as multiplying two matrices AB () Fun fact 2 (novel): solving sparse linear system Ax=b can be cheaper than computing matrix product. This one is insane ()

3

79

355

Konstantin Mishchenko

@konstmish

1 year

How to use a cheap proxy loss function (e.g. simulator or synthetic data) if you care about performance in the real world or on true data? We present a method that can provably minimize the target objective while mostly using the cheaper function! pdf: 1/4

6

49

349

Konstantin Mishchenko

@konstmish

10 months

This is the most mind-blowing result that I've seen in a while. Just when I thought I knew everything about gradient descent, this new theory showed me I didn't. Gradient descent needs to use unreasonably large stepsizes every now and then to converge faster. This is wild.

Ben Grimmer

@prof_grimmer

10 months

I've proven the strangest result of my career.. The classic idea that gradient descent's rate is best with constant stepsizes 1/L is wrong. The idea that we need stepsizes in (0,2/L) for convergence is wrong. Periodic long steps are better, provably.

73

609

4K

8

37

343

Konstantin Mishchenko

@konstmish

3 years

Mirror descent runs gradient descent in the dual space by mapping our vector x into it. But what if instead, we map the gradient and run GD in primal space? This gives us "dual preconditioning" that generalizes various gradient transformations. @cjmaddison

4

58

338

Konstantin Mishchenko

@konstmish

2 years

Five years ago, I started my first optimization project, which was about asynchronous gradient descent. Today, I'm happy to present our new work (with @BachFrancis , M. Even and B. Woodworth) where we finally prove: Delays do not matter. 🧵1/5

2

45

331

Konstantin Mishchenko

@konstmish

5 months

I wish every poster at every conference looked like this. The idea that all text should be same size is just wrong.

2

26

315

Konstantin Mishchenko

@konstmish

2 years

Newton’s method is a great heuristic but it doesn’t work globally (and line search doesn’t fix it). Cubic Newton works globally but needs an expensive subroutine. Can we get something in between? I’m proud to present a method that does exactly that: [1/3]

Regularized Newton Method with Global $O(1/k^2)$ Convergence

We present a Newton-type method that converges fast from any initialization and for arbitrary convex objectives with Lipschitz Hessians. We achieve this by merging the ideas of cubic...

arxiv.org

10

46

279

Konstantin Mishchenko

@konstmish

11 months

Take your favorite optimizer (Adam, SGD, Lion) and feed it into Mechanic to get the learning rate. You give it the direction, it gives you the magnitude. The new optimizer has been tested on a wide range of deep learning problems: ViT, LSTM, ResNet, etc.

6

49

264

Konstantin Mishchenko

@konstmish

3 years

Relative smoothness is one of the most reinvented definitions in optimization. It gives a simple condition for convergence of mirror descent and dates back at least to 2010. Its many applications should be the way mirror descent is motivated in lectures.

3

46

258

Konstantin Mishchenko

@konstmish

11 months

Heavy-ball momentum is different from Nesterov's: its main benefit is 𝘯𝘰𝘵 acceleration, but the fact that the 𝘭𝘢𝘴𝘵 iterate, rather than average, converges under stochastic gradients. It's best explained in this paper that almost no one knows about:

6

44

232

Konstantin Mishchenko

@konstmish

1 year

Do you like when an optimizer works with stepsize=1? @aaron_defazio and I just released a new stepsize estimation trick called D-Adaptation that does that for Adagrad and Adam. Paper pdf: Code: 🧵Thread 1/7

6

55

238

Konstantin Mishchenko

@konstmish

11 months

We just released a new optimizer Prodigy. Based on D-Adaptation, Prodigy has stronger theoretical guarantees and we derive how to estimate the stepsize for Adam. We tested it on a wide range of deep nets, so I can confidently say: it does work in practice.

3

54

211

Konstantin Mishchenko

@konstmish

11 months

A simple yet elegant observation: In SGD, if we apply sign to the stochastic gradient, it breaks the convergence, the signSGD method might oscillate forever. This new paper proposes to inject noise before applying sign and proves it fixes the method:

4

28

209

Konstantin Mishchenko

@konstmish

1 year

An announcement I've been meaning to make for some time, but I wanted to wait until I have a visa. In January 2023, I will be joining Samsung AI center in Cambridge, UK as a Research Scientist! I will be part of Distributed AI group and will keep working on efficient ML and FL.

9

4

211

Konstantin Mishchenko

@konstmish

3 years

Almost no one knows about this NIPS 1993 paper by O. Mangasarian and his PhD student M. Solodov. Yet, today their algorithm is known as Local-SGD and it's a huge success in parallel/federated learning. It'd be a golden oldie if it wasn't ahead of its time.

0

35

202

Konstantin Mishchenko

@konstmish

11 months

SGD in practice usually doesn't sample data uniformly and instead goes over the dataset in epochs, which is called Random Reshuffling. We've known for some time that RR is better than SGD for convex functions and now it's been proven for nonconvex:

5

29

197

Konstantin Mishchenko

@konstmish

3 years

What happens to Adam if we turn off momentum and epsilon? If we set β₁=0, we get RMSprop. If we set β₁=β₂=0, we get signSGD. How well does signSGD with constant batch size converge? It doesn't. Not even with tiny stepsizes and overparameterization.

4

29

188

Konstantin Mishchenko

@konstmish

3 years

Yurii Nesterov started one of his keynote talks by imagining that someone in the audience would say "Oh my God! Again new optimization methods. We hear about that for a decade, why do you need to generate so many schemes and all things like that." Actuallly, a good question...

4

19

185

Konstantin Mishchenko

@konstmish

9 months

Excited that our paper with @aaron_defazio on stepsize-free optimization () got the Outstanding Paper award from ICML! 1/4

Learning-Rate-Free Learning by D-Adaptation

D-Adaptation is an approach to automatically setting the learning rate which asymptotically achieves the optimal rate of convergence for minimizing convex Lipschitz functions, with no...

arxiv.org

6

24

170

Konstantin Mishchenko

@konstmish

10 months

I keep forgetting that backprop doesn't give us (sub)gradients. It can even return a nonzero value on the function that is 0 everywhere, so one needs to be careful with the assumptions on what backrpop can do. A good read on the topic:

7

28

168

Konstantin Mishchenko

@konstmish

4 years

#ICML2020 Our work on Adaptive Gradient Descent to be presented in 30 mins (and one more time in 13 hours). A parameter-free replacement of line search. Provably estimates local curvature. Works out-of-the-box on almost any convex problem. Please retweet!

3

41

165

Konstantin Mishchenko

@konstmish

4 years

Online conferences suck and have almost zero interaction. AISTATS this year had videos with chat sessions, I got 1 question in total from the sessions. But at least papers that won the best paper award should have a lot of interaction? See the picture for the answer.

7

19

157

Konstantin Mishchenko

@konstmish

3 years

"Gradient descent with exact line search may not converge" and other recent (IMO, counterintuitive) counterexamples in convex optimization. My second favorite is "Newton’s flow may not converge" arxiv:

0

30

154

Konstantin Mishchenko

@konstmish

3 months

I often hear people who know optimization theory say "averaging works assuming convexity, so it doesn't make sense for deep learning." Yes it does, it just works. Often, it can replace learning rate scheduler. Just don't use uniform weights, use EMA or polynomial weights.

Sebastian Raschka

@rasbt

3 months

Weight averaging and model merging for LLMs seem to be the most interesting themes in 2024 so far. What are the benefits? Combining multiple models (or checkpoints) into a single one can improve training convergence, overall performance, and also robustness. I will probably do…

21

165

1K

7

14

152

Konstantin Mishchenko

@konstmish

11 months

The gradient gives you direction to a solution, but it doesn't tell you how far you should go. To remove parameter tuning from SGD, we need to estimate the distance. Today we present DoWG, a method that can do that provably for a wide class of functions:

2

26

151

Konstantin Mishchenko

@konstmish

3 years

SGD in theory: sample uniformly from all data SGD in practice (aka Random Reshuffling): go over the data in random order. If you have`for data in trainloader` in your code, you're using it. If you want to know why it's faster, check our #neurips paper!

3

17

145

Konstantin Mishchenko

@konstmish

3 years

I recall the following problem that I was asked by a hedge fund during my interview. For any n angles a_1, ..., a_n from [0, 2pi], prove that the matrix A whose entries are A_{ij}=cos(a_i-a_j) is positive semi-definite. Don't know how it helps with trading but the problem is nice

Gabriel Peyré

@gabrielpeyre

3 years

Positive semi-definite matrices come in many flavors!

3

108

638

5

18

138

Konstantin Mishchenko

@konstmish

2 years

Interesting new paper on a subclass of convex functions that are smooth just at the optima. A big surprise is that Heavy-Ball Momentum turns out to be optimal despite being suboptimal in a smaller class of convex functions that are smooth everywhere.

0

29

134

Konstantin Mishchenko

@konstmish

9 months

This paper on training RNNs that proposed using clipping to tackle exploding gradients is one of those papers that don't get old. I really enjoyed reading it, especially since clipping has become a vital component of training transformers and other models.

1

20

134

Konstantin Mishchenko

@konstmish

3 years

Very interesting news, @SchmidhuberAI is joining KAUST as a Director of the AI Initiative. I think this is a great match, he has a lot of experience in the area and seems to have big ambitions, while KAUST provides a great environment for building and growing research labs.

2

11

134

Konstantin Mishchenko

@konstmish

3 years

In minmax optimization, there is no direct analog of Nesterov's acceleration. However, this year, some form of acceleration was proposed to make gradient norms small faster. It is a bit similar to the extragradient but uses a very unusual momentum. arxiv:

1

17

131

Konstantin Mishchenko

@konstmish

2 years

This made my day, I really hope that TMLR will become more popular than NeurIPS, ICML, and other ML conferences. deadline=>higher submission quality as quick as at conferences need to review 7 papers at once

Hugo Larochelle

@hugo_larochelle

2 years

Today, @RaiaHadsell , @kchonyc and I are happy to announce the creation of a new journal: Transaction on Machine Learning Research (TMLR) Learn more in our post:

54

693

3K

3

7

125

Konstantin Mishchenko

@konstmish

9 months

"Almost every piece of journalism on AI I see is not very good, that's the shocking thing. They are all going for this danger that it's going to surpass humans. A computer already surpasses me in ability to calculate integrals, it surpasses me in tons of ways, I don't care." 3/3

3

7

114

Konstantin Mishchenko

@konstmish

10 months

I was wondering what Nesterov works on these days, seems like he's currently interested in minimizing 4th-order polynomials. He also uses these methods as subsolvers for nonconvex problems with 4th-order components, such as phase retrieval. pdf:

2

14

115

Konstantin Mishchenko

@konstmish

3 years

The difficulty of analyzing quasi-Newton methods comes from the fact that we try to "learn" the Hessian. Rodomanov and Nesterov made a breakthrough last year quantifying this and showing rates of convergence. This new paper proposes an even simpler theory!

3

15

109

Konstantin Mishchenko

@konstmish

9 months

Deep learning: is it more about non-convexity or non-smoothness? I firmly believe it's non-smoothness. Adagrad and other tricks for non-smooth problems had a lot of impact. In contrast, methods for smooth non-convex problems, especially variance reduction, do not seem as useful.

7

2

104

Konstantin Mishchenko

@konstmish

11 months

Gradients Gescent used to be studied under "quadratic-like" assumptions, with quadratic upper and lower bounds. It's nice to see more and more papers break this tradition and show GD and other invented tricks do not need these assumptions, like this paper:

1

15

99

Konstantin Mishchenko

@konstmish

3 years

Everyone knows that the dual norm to l_inf norm is l_1 norm. What happens if we define k-norm as the sum of the k largest components (k=1 corresponds to l_inf)? It turns out the answer is very simple and combines l_1 and l_inf in an interesting way:

4

9

98

Konstantin Mishchenko

@konstmish

7 months

Constrained optimization perspective on what Lion optimizer is doing. They also generalize Lion to operations other than sign in the update. Paper: It seems highly related to dual space preconditioning, which is somehow not cited:

0

16

95

Konstantin Mishchenko

@konstmish

2 years

Surprise, reviews are random! -25/44 papers chosen for spotlight were rejected by the second group. -But at least if all reviewers vote "7", it's a solid "7"? Nope, still random, it could be "4, 5, 4"! So is a NeurIPS paper more credible than an arXiv submission? I'm not so sure.

Sebastian Raschka

@rasbt

2 years

"NeurIPS 2021 finally accepted submissions statistics" from (summarized via ). Wow, it looks like 199 out of the 298 papers (that is, 2/3!!!) accepted by group 1 would have been rejected by group 2.

9

97

396

2

21

91

Konstantin Mishchenko

@konstmish

2 years

If the credit is shared, so should be responsibility for plagiarism. A senior author can't walk away saying "I didn't actively participate" because then they shouldn't have been an author in the first place. There even was arxiv comment about plagiarism, all authors could see it.

Lucas Beyer (bl16)

@giffmana

2 years

Roadmap to big plagiarism🤦‍♂️ Let me guess the reaction this will get "we identified that ALL the instances were written by ONE of the authors, and we have taken action against him/her." And that scapegoat will be the most junior of them all. I hope I'm wrong though!

17

7

153

4

6

89

Konstantin Mishchenko

@konstmish

1 month

Finished responding to all ICML rebuttals for papers where I am a reviewer, increased the score for one of the papers and explained why I don't agree with the responses for the other papers. Please do that if you're an ICML reviewer too, it doesn't take that much time.

4

89

Konstantin Mishchenko

@konstmish

10 months

TensorFlow and Flax set the default momentum in batchnorm to 0.99, while PyTorch uses 0.9. This parameter (β) is used to estimate the running mean as E[x] = β*E[x] + (1-β)*mean(batch), so it appears very important. Did anyone study the optimal choice? I can't find anything.

6

11

88

Konstantin Mishchenko

@konstmish

11 months

Overparameterization empowers deep learning, but in federated learning the data is heterogeneous across clients, which makes it very difficult. We just proved that the solution to this is 𝑜𝑣𝑒𝑟𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛, and it's always achievable:

0

22

85

Konstantin Mishchenko

@konstmish

2 years

I was lucky to benefit from the former friendship between Russia and Ukraine. I have many Ukrainian friends. Back in 2011, I spent 1 month studying math at a summer camp near Kyiv. I'm Russian and I'm against the terrible war started by Russia. It is so cruel and so unnecessary.

1

7

85

Konstantin Mishchenko

@konstmish

2 years

@karpathy I want to point out that mathematically it matters if you're going over the dataset in random order (epochs) or if you sample random points from the whole dataset (iterations). The former is Random Reshuffling, the latter is SGD; they have different convergence rates.

3

4

83

Konstantin Mishchenko

@konstmish

3 years

Implicit meta-learning (iMAML) uses implicit gradient updates that remove the need to differentiate through an optimization loop. The objective it minimizes is the sum of Moreau envelopes, so iMAML converges to initialization close to task-loss minimizers.

2

19

82

Konstantin Mishchenko

@konstmish

3 months

Nothing makes me so happy as seeing some of my work be used by people. Today I noticed D-Adaptation PyTorch implementation has been installed >1M times. Prodigy is getting close to 0.4M. If you want to try it: D-Adaptation: Prodigy:

GitHub - konstmish/prodigy: The Prodigy optimizer and its variants for training neural networks.

The Prodigy optimizer and its variants for training neural networks. - konstmish/prodigy

github.com

1

9

83

Konstantin Mishchenko

@konstmish

2 years

I'm thrilled to share that I've been selected as a Rising Star in Data Science by the University of Chicago for my work on optimization and federated learning. It is very encouraging and I'm honored to be part of a great cohort! Official website:

6

2

83

Konstantin Mishchenko

@konstmish

3 years

The ratings are not final but I decided that anyway I'd mention 10 optimization papers with top scores 💪 Thread! 0/10

Sergey Ivanov

@SergeyI49013776

3 years

ICLR 2021 reviews are public and here is a sorted list for all papers with ratings. What's interesting is that none of the papers scored only 10s. The top-1 is a "graph" paper with 8.25. #iclr #iclr21 #iclr2021

0

72

313

1

13

82

Konstantin Mishchenko

@konstmish

2 months

Had the pleasure of presenting my work on adaptive optimizers at the CISS conference () at Princeton. It was a great event with talks by @KairouzPeter , @srush_nlp , @jasondeanlee and others, I’m very grateful to @chijinML for inviting me!

1

7

81

Konstantin Mishchenko

@konstmish

4 years

One of the most entertaining linear algebra books: "33 miniatures on linear algebra" by Matoušek I enjoyed all of them but #12 even shocked me: one can prove impossibility of tiling a rectangle by finite # of squares using Hahn–Banach theorem for R over Q.

1

17

80

Konstantin Mishchenko

@konstmish

1 year

Our paper got accepted as an oral for ICML but my US visa application got rejected, so I'd like to present it where I can. Are there any ML research seminars in London or around where I can give a talk? I can also cover more recent (and very exciting) parameter-free methods!

Konstantin Mishchenko

@konstmish

1 year

Do you like when an optimizer works with stepsize=1? @aaron_defazio and I just released a new stepsize estimation trick called D-Adaptation that does that for Adagrad and Adam. Paper pdf: Code: 🧵Thread 1/7

6

55

238

3

10

79

Konstantin Mishchenko

@konstmish

4 years

Few people know but the original paper of Nesterov written in 1984 is just 5 pages and very easy to follow. Convergence statement and the proof fit 1 page. Some call it "mysterious algebraic manipulations", but I disagree–it's simply a Lyapunov analysis.

Gabriel Peyré

@gabrielpeyre

4 years

Oldies but goldies: Yurri Nesterov, A method for solving a convex programming problem with convergence rate O(1/k^2). Improves the 1/k rate of vanilla gradient descent to 1/k^2. The idea had a profound impact on mathematical programming.

2

81

330

4

7

78

Konstantin Mishchenko

@konstmish

10 months

The book "The Man Who Solved the Market" about Jim Simons is a must-read for people who love math. My favorite quote, attributed to an accomplished code-breaker Lenny Baum, is “Bad ideas is good, good ideas is terrific, no ideas is terrible.” A great motto for doing research.

1

9

77

Konstantin Mishchenko

@konstmish

1 month

I often mention in my talks that we should worry less about nonconvexity and more about non-smoothness. We can get surprisingly far without worrying about all those saddle points, but ignoring non-smoothness quickly leads to methods that don't work in practice.

Aaron Defazio

@aaron_defazio

1 month

I find the non-smooth convex analysis framework seems to better describe the behavior of neural network training than any nonconvex analysis framework.

6

5

99

5

69

Konstantin Mishchenko

@konstmish

4 months

I've never felt so GPU-poor as right now.

1

4

62

Konstantin Mishchenko

@konstmish

2 years

This week we presented our ICLR 2022 Spotlight work "IntSGD: Adaptive Floatless Compression of Stochastic Gradients". Key idea: adaptively compress float32->int8 on multiple GPUs paper: slides: poster:

1

16

63

Konstantin Mishchenko

@konstmish

3 years

Let me try to mimic ICML reviewers for these presentations. A thread with 3 reviews. 1. The drawing has some novelty but the overall quality is disappointing. Why the rocket doesn't have wings? All birds have it. Also, without experiments, the drawing is not too interesting.

Peter Richtarik

@peter_richtarik

3 years

#ICML2021 reviews: (thanks to @mher_safaryan for the link)

0

4

33

1

10

60

Konstantin Mishchenko

@konstmish

2 years

Happy to have 2 papers accepted at #ICML2022 on communication-efficient SGD: 1. Multiple gradient steps before aggregation give provable acceleration: 2. Random Reshuffling (SGD without resampling) with full passes over local data:

Proximal and Federated Random Reshuffling

Random Reshuffling (RR), also known as Stochastic Gradient Descent (SGD) without replacement, is a popular and theoretically grounded method for finite-sum minimization. We propose two new...

arxiv.org

9

8

58

Konstantin Mishchenko

@konstmish

9 months

He strongly advocates for a balanced view. Not too exuberant, not too fearful. "Innovation is altogether inevitable and positive." but also "If you were an exuberant AI bro, which there are many down in Silicon Valley, I would have told you to chill and shut up." 2/3

2

4

55

Konstantin Mishchenko

@konstmish

2 years

The recording of my talk about globally convergent Newton method is available on youtube: I was lucky to have many experts on second-order methods in the audience who commented on the results, I hope it’d make the video more fun!

Konstantin Mishchenko

@konstmish

2 years

Newton’s method is a great heuristic but it doesn’t work globally (and line search doesn’t fix it). Cubic Newton works globally but needs an expensive subroutine. Can we get something in between? I’m proud to present a method that does exactly that: [1/3]

10

46

279

3

11

55

Konstantin Mishchenko

@konstmish

4 years

Dykstra's algorithm works for any number of sets n. If the sets are linear, you recover Kaczmarz algorithm. If you replace projection onto j-th set with prox of some function g_j, you get SDCA. If you include a gradient descent step, you get SDM:

Gabriel Peyré

@gabrielpeyre

4 years

Oldies but goldies: J. P. Boyle and R. L. Dykstra, A method for finding projections onto the intersection of convex sets in Hilbert spaces, 1986. Dykstra’s algorithm is the simplest method to project on the intersection of two convex sets.

2

49

192

0

8

56

Konstantin Mishchenko

@konstmish

9 months

A paper worth reading!

Maor Ivgi

@maorivg

9 months

Tired of tuning learning rates? Excited to share our #ICML23 paper: DoG, a simple parameter-free optimizer backed by solid theoretical guarantees and showing strong empirical results for over 20 diverse tasks and 8 vision & NLP models. 🐶🧵

6

83

330

0

7

51

Konstantin Mishchenko

@konstmish

3 years

I always thought that Lasso is just a trick to get a sparse solution quickly. I also thought it's worse than finding the best subset of parameters with same cardinality. Turns out Lasso's regularization might be important, and neither is always better. pdf:

1

2

48

Konstantin Mishchenko

@konstmish

10 months

Great ICML paper on communication compression: CoctailSGD combines top-K, rand-K and quantization to get the highest compression ratios. Local-update methods, such as ProxSkip, seem to fall far behind together with pure top-K compression. pdf:

1

11

47

Konstantin Mishchenko

@konstmish

3 years

I really like this paper about variance reduction for operators and minmax problems. Clean convergence proofs, and the algorithm itself has an interesting extra step that makes it slightly different from the extragradient and mirror-prox methods. arxiv:

2

9

47

Konstantin Mishchenko

@konstmish

3 months

I turned 30 today. Feeling introspective. Any suggestions on things to read or think about?

17

0

47

Konstantin Mishchenko

@konstmish

3 years

Another misconception: line search makes Newton's method work globally for convex functions. Even the books of Nesterov and of Boyd & Vandenberghe suggest using it. But there is a counterexample! I don't understand how it can have only 3 citations... Link:

Keenan Crane

@keenanisalive

3 years

Common misconception: Newton's method provides the "ideal" descent direction, and everything else is a cheap approximation. Couldn't be further from the truth! Newton is great when you're really close to a minimizer… and is a total heuristic everywhere else.

19

135

788

2

7

45

Konstantin Mishchenko

@konstmish

3 years

The discussion period for ICML has started! As an author, I'm often frustrated if the reviewers ignored my rebuttal. As a reviewer, I'm frustrated as the discussion takes too much time. So here are a few tips to minimize your effort but make authors feel you read their feedback.

1

8

44

Konstantin Mishchenko

@konstmish

10 months

Conjugate gradients are commonly used for solving linear systems, but their generalized nonlinear variants do not seem popular these days. Looks like for a good reason: in the worst case, they are not even better than gradient descent.

1

4

44

Konstantin Mishchenko

@konstmish

10 months

Gradient normalization and clipping are commonly used with SGD or Adam. The theory behind Normalized GD (NGD) is due to Shor (1985), but I particularly like this paper by @prof_grimmer where he shows that NGD works for a wide range of function classes:

2

5

40

Konstantin Mishchenko

@konstmish

1 year

I am happy to share that I will present this work at ICML 🙂

Konstantin Mishchenko

@konstmish

1 year

How to use a cheap proxy loss function (e.g. simulator or synthetic data) if you care about performance in the real world or on true data? We present a method that can provably minimize the target objective while mostly using the cheaper function! pdf: 1/4

6

49

349

2

1

41

Konstantin Mishchenko

@konstmish

4 months

@ilgyrd Here it is. They prove that gradient descent converges faster if we periodically take very large steps, contrary to the common belief that one 1/L stepsize is optimal (L is the Lipschitz constant of the gradient)

Provably Faster Gradient Descent via Long Steps

This work establishes new convergence guarantees for gradient descent in smooth convex optimization via a computer-assisted analysis technique. Our theory allows nonconstant stepsize policies with...

arxiv.org

1

40

Konstantin Mishchenko

@konstmish

8 months

Accepted for presentation at NeurIPS!

Konstantin Mishchenko

@konstmish

11 months

The gradient gives you direction to a solution, but it doesn't tell you how far you should go. To remove parameter tuning from SGD, we need to estimate the distance. Today we present DoWG, a method that can do that provably for a wide class of functions:

2

26

151

0

3

40

Konstantin Mishchenko

@konstmish

10 months

Thrilled to join the Transactions on Machine Learning Research (TMLR) as an Action Editor!

2

1

38

Konstantin Mishchenko

@konstmish

8 months

There seems to be some confusion about the contributions of our D-Adaptation paper with @aaron_defazio , so we will write a blog post about the method and how it was developed. It will likely take us a few weeks, so stay tuned.

0

3

38

Konstantin Mishchenko

@konstmish

4 months

I hope to see more research directed toward practical aspects (adaptivity, nonsmoothness, learning-rate schedules), and less research on combining working ideas for the sake of writing papers. Please share the ideas that you find exciting, I must have missed quite a few! 8/8

7

2

38

Konstantin Mishchenko

@konstmish

3 months

I opened "33 Miniatures" by Matoušek to read something mathematical and pleasant, but this one made me angry. So we just want to prove that one can't tile by squares a rectangle with side lengths 1 and x, where x is irrational. It's simple, right? Geometrical proof maybe? 1/3

2

1

36

Konstantin Mishchenko

@konstmish

4 years

Gradient descent sometimes converges the same way on completely different datasets. For quadratics, it can be proved using the Marchenko-Pastur law: I'm still wondering though why it holds for SGD and neural nets like resnet-18 and densenet-121.

Gabriel Peyré

@gabrielpeyre

4 years

The Marchenko-Pastur law is the distribution limit of eigenvalues of covariance matrices when samples and size grow together.

1

63

311

0

10

36

Konstantin Mishchenko

@konstmish

3 years

Distributed optimization requires averaging the gradients computed on different nodes. This may be slower than backprop, so we can save time by using lossy compressions. Error feedback reduces the error by storing it in memory and correcting next updates.

2

9

35

Konstantin Mishchenko

@konstmish

3 years

SGD, SVRG (convex/non-convex), SAGA, Nesterov's acceleration, adaptive gradient, primal-dual, distributed, and tensor methods—all these algorithms can be analyzed with Lyapunov functions. Some of them do not have known proofs without Lyapunov arguments.

Ahmed

@ahmedallibhoy

3 years

I’m having an especially hard time concentrating on research this week, but instead of doomscrolling on twitter I guess I’ll do something semi-productive and make another twitter thread. Today's topic: What is a Lyapunov function? 1/

23

281

1K

2

1

35

Konstantin Mishchenko

@konstmish

4 months

In the last two decades, the literature has been drifting from classical and somewhat small problems (linear programming, least squares, SVM, etc) to deep learning. This transition has made a lot of ideas obsolete, while stochastic first-order methods rose in popularity. 1/8

1

35

Konstantin Mishchenko

@konstmish

4 months

Yet, the literature on convex optimization might lead us astray too. Lots of papers concerned with complexity guarantees propose impractical methods that require multiple unknown parameters. What we need is the opposite: adaptive methods that remove the need to know them. 3/8

1

0

33

Konstantin Mishchenko

@konstmish

3 years

3. My main concern is just the overall incremental nature of the work as the macaroni prototype was done by Gru (2020). The experiments are conducted in a simulated environment in a backyard, not in a proper spaceport. Also, I don’t think it can be useful for deep learning

1

0

33

Konstantin Mishchenko

@konstmish

4 months

There are, however, many new interesting developments. For instance, (L₀, L₁) setting which can explain the success of gradient clipping. Other interesting topics: heavy-tail noise, properties of normalization layers, noise injections, implicit bias, edge of stability, etc. 5/8

1

0

33

Konstantin Mishchenko

@konstmish

4 years

Today on arxiv: @gowerrobert @osebbouh and @NicLoizou develop new theory for SGD for finite-sum minimization. They give new, less restrictive bounds for its convergence, discuss optimal minibatches and touch upon Polyak stepsize (my favorite!)

0

4

32

Konstantin Mishchenko

@konstmish

10 months

I'm a bit surprised but I'm a TMLR Expert Reviewer now!

Hugo Larochelle

@hugo_larochelle

10 months

We have just finalized our first selection of TMLR Expert Reviewers. These are reviewers who have done particularly exemplary work in evaluating TMLR submissions. See the following page for details and the list of reviewers:

2

23

185

1

0

32

Konstantin Mishchenko

@konstmish

4 months

For deep learning in 2024, what benchmarks are best for optimizers? Do LoRA and full-finetuning represent most use cases, or pre-training is still valuable too? What tasks and datasets? Please share your thoughts and code. I'd really appreciate any help in figuring this out.

4

3

30

Konstantin Mishchenko

@konstmish

2 years

The main result is as follows: if the 3rd derivatives are Lipschitz, a variant of regularized Newton method achieves O(1/k^3) convergence, which is the same as the rate of the 3rd tensor method. I.e. we can drop third derivatives and *nothing* changes. To me, this is insane. 2/5

1

0

29

Konstantin Mishchenko

@konstmish

3 years

2. The authors present a macaroni prototype but it’s just an incremental improvement of the work by Gru (2019), in which he drew the prototype. Making a macaroni prototype based on the drawing is pretty trivial, and it's still not convincing that the rocket will work

1

0

28

Konstantin Mishchenko

@konstmish

5 months

I'm at neurips for the whole week, I'm always happy to meet new people, so don't hesitate to message me if you want to chat. I'm also presenting this paper () with @ahmedkhaledv2 on Tuesday at 10:45-12:45. #NeurIPS2023

0

3

28

Konstantin Mishchenko

@konstmish

10 months

I gave a talk on adaptive methods yesterday, explaining my recent papers on estimating the stepsize in Adam from how the gradients are correlated with our updates so far. The recording is on YouTube in the quoted tweet. My slides are available here:

Shuvomoy Das Gupta

@shuvo_das_gupta

10 months

Dr. Konstantin Mishchenko ( @konstmish ) virtually visited MIT Operations Research Center yesterday and gave an exciting talk on Zoom about his work on parameter-free adaptive methods for deep learning! His talk is available at the youtube link: !

0

7

34

1

5

27