Konstantin Mishchenko Profile Banner
Konstantin Mishchenko Profile
Konstantin Mishchenko

@konstmish

4,324
Followers
563
Following
129
Media
511
Statuses

Research Scientist at Samsung AI Center Outstanding Paper Award @icmlconf 2023 Action Editor @TmlrOrg I tweet about ML papers and math

London, UK
Joined June 2020
Don't wanna be here? Send us removal request.
Pinned Tweet
@konstmish
Konstantin Mishchenko
4 months
Is optimization still a good topic to work on for a PhD? Here is my detailed response to a student who reached out asking this question.
Tweet media one
11
84
776
@konstmish
Konstantin Mishchenko
3 years
The source of information most useful to me during my PhD was neither a book nor a lecture series nor a paper. It was this simple cheatsheet by @fpedregosa : His other blog posts are great too!
Tweet media one
7
130
838
@konstmish
Konstantin Mishchenko
2 years
Let me share a result that surprised even Yurii Nesterov. You know how Taylor series become more accurate as you use more terms? Well, it turns out that in convex optimization, 3rd derivatives might be completely useless Let me explain. Thread 1/5
Tweet media one
12
159
814
@konstmish
Konstantin Mishchenko
1 year
Ever wanted to understand the convergence theory of SGD? Probably best to start with this new paper () that collects all simple proofs in one place: momentum, minibatching, prox, averaging, and more. Covers nonconvex and nonsmooth cases too.
7
117
613
@konstmish
Konstantin Mishchenko
9 months
Michael I. Jordan on ChatGPT and AI: "We're fearful in 10 years it can get out of control and all that. That's just not true. ChatGPT is predicting the next word in a sentence, just remember that's what it does". 1/3
36
70
444
@konstmish
Konstantin Mishchenko
4 months
I'd like to start 2024 with some thoughts on the state of optimization theory. To begin with, let me quote Nesterov's 2003 book: "The main fact, which should be known to any person dealing with optimization models, is that, in general, the optimization problems are unsolvable" 🧵
3
75
435
@konstmish
Konstantin Mishchenko
6 months
Why do we need warm-up, cosine annealing, and other learning rate schedules when training with gradient descent? It turns out it's all about how gradient norms change over time. E.g., large norms at the start => warm-up. Slow decrease => cosine. Paper: 0/4
Tweet media one
3
72
420
@konstmish
Konstantin Mishchenko
3 years
Fun fact 1 (standard): computing matrix inverse A^(-1) has the same complexity as multiplying two matrices AB () Fun fact 2 (novel): solving sparse linear system Ax=b can be cheaper than computing matrix product. This one is insane ()
Tweet media one
3
79
355
@konstmish
Konstantin Mishchenko
1 year
How to use a cheap proxy loss function (e.g. simulator or synthetic data) if you care about performance in the real world or on true data? We present a method that can provably minimize the target objective while mostly using the cheaper function! pdf: 1/4
Tweet media one
6
49
349
@konstmish
Konstantin Mishchenko
10 months
This is the most mind-blowing result that I've seen in a while. Just when I thought I knew everything about gradient descent, this new theory showed me I didn't. Gradient descent needs to use unreasonably large stepsizes every now and then to converge faster. This is wild.
@prof_grimmer
Ben Grimmer
10 months
I've proven the strangest result of my career.. The classic idea that gradient descent's rate is best with constant stepsizes 1/L is wrong. The idea that we need stepsizes in (0,2/L) for convergence is wrong. Periodic long steps are better, provably.
Tweet media one
73
609
4K
8
37
343
@konstmish
Konstantin Mishchenko
3 years
Mirror descent runs gradient descent in the dual space by mapping our vector x into it. But what if instead, we map the gradient and run GD in primal space? This gives us "dual preconditioning" that generalizes various gradient transformations. @cjmaddison
Tweet media one
4
58
338
@konstmish
Konstantin Mishchenko
2 years
Five years ago, I started my first optimization project, which was about asynchronous gradient descent. Today, I'm happy to present our new work (with @BachFrancis , M. Even and B. Woodworth) where we finally prove: Delays do not matter. 🧵1/5
Tweet media one
2
45
331
@konstmish
Konstantin Mishchenko
5 months
I wish every poster at every conference looked like this. The idea that all text should be same size is just wrong.
Tweet media one
2
26
315
@konstmish
Konstantin Mishchenko
2 years
Newton’s method is a great heuristic but it doesn’t work globally (and line search doesn’t fix it). Cubic Newton works globally but needs an expensive subroutine. Can we get something in between? I’m proud to present a method that does exactly that: [1/3]
10
46
279
@konstmish
Konstantin Mishchenko
11 months
Take your favorite optimizer (Adam, SGD, Lion) and feed it into Mechanic to get the learning rate. You give it the direction, it gives you the magnitude. The new optimizer has been tested on a wide range of deep learning problems: ViT, LSTM, ResNet, etc.
Tweet media one
6
49
264
@konstmish
Konstantin Mishchenko
3 years
Relative smoothness is one of the most reinvented definitions in optimization. It gives a simple condition for convergence of mirror descent and dates back at least to 2010. Its many applications should be the way mirror descent is motivated in lectures.
Tweet media one
3
46
258
@konstmish
Konstantin Mishchenko
11 months
Heavy-ball momentum is different from Nesterov's: its main benefit is 𝘯𝘰𝘵 acceleration, but the fact that the 𝘭𝘢𝘴𝘵 iterate, rather than average, converges under stochastic gradients. It's best explained in this paper that almost no one knows about:
Tweet media one
6
44
232
@konstmish
Konstantin Mishchenko
1 year
Do you like when an optimizer works with stepsize=1? @aaron_defazio and I just released a new stepsize estimation trick called D-Adaptation that does that for Adagrad and Adam. Paper pdf: Code: 🧵Thread 1/7
Tweet media one
6
55
238
@konstmish
Konstantin Mishchenko
11 months
We just released a new optimizer Prodigy. Based on D-Adaptation, Prodigy has stronger theoretical guarantees and we derive how to estimate the stepsize for Adam. We tested it on a wide range of deep nets, so I can confidently say: it does work in practice.
Tweet media one
3
54
211
@konstmish
Konstantin Mishchenko
11 months
A simple yet elegant observation: In SGD, if we apply sign to the stochastic gradient, it breaks the convergence, the signSGD method might oscillate forever. This new paper proposes to inject noise before applying sign and proves it fixes the method:
Tweet media one
4
28
209
@konstmish
Konstantin Mishchenko
1 year
An announcement I've been meaning to make for some time, but I wanted to wait until I have a visa. In January 2023, I will be joining Samsung AI center in Cambridge, UK as a Research Scientist! I will be part of Distributed AI group and will keep working on efficient ML and FL.
9
4
211
@konstmish
Konstantin Mishchenko
3 years
Almost no one knows about this NIPS 1993 paper by O. Mangasarian and his PhD student M. Solodov. Yet, today their algorithm is known as Local-SGD and it's a huge success in parallel/federated learning. It'd be a golden oldie if it wasn't ahead of its time.
Tweet media one
0
35
202
@konstmish
Konstantin Mishchenko
11 months
SGD in practice usually doesn't sample data uniformly and instead goes over the dataset in epochs, which is called Random Reshuffling. We've known for some time that RR is better than SGD for convex functions and now it's been proven for nonconvex:
Tweet media one
5
29
197
@konstmish
Konstantin Mishchenko
3 years
What happens to Adam if we turn off momentum and epsilon? If we set β₁=0, we get RMSprop. If we set β₁=β₂=0, we get signSGD. How well does signSGD with constant batch size converge? It doesn't. Not even with tiny stepsizes and overparameterization.
Tweet media one
4
29
188
@konstmish
Konstantin Mishchenko
3 years
Yurii Nesterov started one of his keynote talks by imagining that someone in the audience would say "Oh my God! Again new optimization methods. We hear about that for a decade, why do you need to generate so many schemes and all things like that." Actuallly, a good question...
4
19
185
@konstmish
Konstantin Mishchenko
10 months
I keep forgetting that backprop doesn't give us (sub)gradients. It can even return a nonzero value on the function that is 0 everywhere, so one needs to be careful with the assumptions on what backrpop can do. A good read on the topic:
Tweet media one
7
28
168
@konstmish
Konstantin Mishchenko
4 years
#ICML2020 Our work on Adaptive Gradient Descent to be presented in 30 mins (and one more time in 13 hours). A parameter-free replacement of line search. Provably estimates local curvature. Works out-of-the-box on almost any convex problem. Please retweet!
Tweet media one
3
41
165
@konstmish
Konstantin Mishchenko
4 years
Online conferences suck and have almost zero interaction. AISTATS this year had videos with chat sessions, I got 1 question in total from the sessions. But at least papers that won the best paper award should have a lot of interaction? See the picture for the answer.
Tweet media one
7
19
157
@konstmish
Konstantin Mishchenko
3 years
"Gradient descent with exact line search may not converge" and other recent (IMO, counterintuitive) counterexamples in convex optimization. My second favorite is "Newton’s flow may not converge" arxiv:
Tweet media one
0
30
154
@konstmish
Konstantin Mishchenko
3 months
I often hear people who know optimization theory say "averaging works assuming convexity, so it doesn't make sense for deep learning." Yes it does, it just works. Often, it can replace learning rate scheduler. Just don't use uniform weights, use EMA or polynomial weights.
@rasbt
Sebastian Raschka
3 months
Weight averaging and model merging for LLMs seem to be the most interesting themes in 2024 so far. What are the benefits? Combining multiple models (or checkpoints) into a single one can improve training convergence, overall performance, and also robustness. I will probably do…
Tweet media one
21
165
1K
7
14
152
@konstmish
Konstantin Mishchenko
11 months
The gradient gives you direction to a solution, but it doesn't tell you how far you should go. To remove parameter tuning from SGD, we need to estimate the distance. Today we present DoWG, a method that can do that provably for a wide class of functions:
Tweet media one
2
26
151
@konstmish
Konstantin Mishchenko
3 years
SGD in theory: sample uniformly from all data SGD in practice (aka Random Reshuffling): go over the data in random order. If you have`for data in trainloader` in your code, you're using it. If you want to know why it's faster, check our #neurips paper!
Tweet media one
3
17
145
@konstmish
Konstantin Mishchenko
3 years
I recall the following problem that I was asked by a hedge fund during my interview. For any n angles a_1, ..., a_n from [0, 2pi], prove that the matrix A whose entries are A_{ij}=cos(a_i-a_j) is positive semi-definite. Don't know how it helps with trading but the problem is nice
@gabrielpeyre
Gabriel Peyré
3 years
Positive semi-definite matrices come in many flavors!
3
108
638
5
18
138
@konstmish
Konstantin Mishchenko
2 years
Interesting new paper on a subclass of convex functions that are smooth just at the optima. A big surprise is that Heavy-Ball Momentum turns out to be optimal despite being suboptimal in a smaller class of convex functions that are smooth everywhere.
Tweet media one
0
29
134
@konstmish
Konstantin Mishchenko
9 months
This paper on training RNNs that proposed using clipping to tackle exploding gradients is one of those papers that don't get old. I really enjoyed reading it, especially since clipping has become a vital component of training transformers and other models.
Tweet media one
1
20
134
@konstmish
Konstantin Mishchenko
3 years
Very interesting news, @SchmidhuberAI is joining KAUST as a Director of the AI Initiative. I think this is a great match, he has a lot of experience in the area and seems to have big ambitions, while KAUST provides a great environment for building and growing research labs.
Tweet media one
2
11
134
@konstmish
Konstantin Mishchenko
3 years
In minmax optimization, there is no direct analog of Nesterov's acceleration. However, this year, some form of acceleration was proposed to make gradient norms small faster. It is a bit similar to the extragradient but uses a very unusual momentum. arxiv:
Tweet media one
1
17
131
@konstmish
Konstantin Mishchenko
2 years
This made my day, I really hope that TMLR will become more popular than NeurIPS, ICML, and other ML conferences. deadline=>higher submission quality as quick as at conferences need to review 7 papers at once
@hugo_larochelle
Hugo Larochelle
2 years
Today, @RaiaHadsell , @kchonyc and I are happy to announce the creation of a new journal: Transaction on Machine Learning Research (TMLR) Learn more in our post:
54
693
3K
3
7
125
@konstmish
Konstantin Mishchenko
9 months
"Almost every piece of journalism on AI I see is not very good, that's the shocking thing. They are all going for this danger that it's going to surpass humans. A computer already surpasses me in ability to calculate integrals, it surpasses me in tons of ways, I don't care." 3/3
3
7
114
@konstmish
Konstantin Mishchenko
10 months
I was wondering what Nesterov works on these days, seems like he's currently interested in minimizing 4th-order polynomials. He also uses these methods as subsolvers for nonconvex problems with 4th-order components, such as phase retrieval. pdf:
Tweet media one
2
14
115
@konstmish
Konstantin Mishchenko
3 years
The difficulty of analyzing quasi-Newton methods comes from the fact that we try to "learn" the Hessian. Rodomanov and Nesterov made a breakthrough last year quantifying this and showing rates of convergence. This new paper proposes an even simpler theory!
Tweet media one
3
15
109
@konstmish
Konstantin Mishchenko
9 months
Deep learning: is it more about non-convexity or non-smoothness? I firmly believe it's non-smoothness. Adagrad and other tricks for non-smooth problems had a lot of impact. In contrast, methods for smooth non-convex problems, especially variance reduction, do not seem as useful.
7
2
104
@konstmish
Konstantin Mishchenko
11 months
Gradients Gescent used to be studied under "quadratic-like" assumptions, with quadratic upper and lower bounds. It's nice to see more and more papers break this tradition and show GD and other invented tricks do not need these assumptions, like this paper:
Tweet media one
1
15
99
@konstmish
Konstantin Mishchenko
3 years
Everyone knows that the dual norm to l_inf norm is l_1 norm. What happens if we define k-norm as the sum of the k largest components (k=1 corresponds to l_inf)? It turns out the answer is very simple and combines l_1 and l_inf in an interesting way:
Tweet media one
4
9
98
@konstmish
Konstantin Mishchenko
7 months
Constrained optimization perspective on what Lion optimizer is doing. They also generalize Lion to operations other than sign in the update. Paper: It seems highly related to dual space preconditioning, which is somehow not cited:
Tweet media one
0
16
95
@konstmish
Konstantin Mishchenko
2 years
Surprise, reviews are random! -25/44 papers chosen for spotlight were rejected by the second group. -But at least if all reviewers vote "7", it's a solid "7"? Nope, still random, it could be "4, 5, 4"! So is a NeurIPS paper more credible than an arXiv submission? I'm not so sure.
Tweet media one
@rasbt
Sebastian Raschka
2 years
"NeurIPS 2021 finally accepted submissions statistics" from (summarized via ). Wow, it looks like 199 out of the 298 papers (that is, 2/3!!!) accepted by group 1 would have been rejected by group 2.
Tweet media one
9
97
396
2
21
91
@konstmish
Konstantin Mishchenko
2 years
If the credit is shared, so should be responsibility for plagiarism. A senior author can't walk away saying "I didn't actively participate" because then they shouldn't have been an author in the first place. There even was arxiv comment about plagiarism, all authors could see it.
Tweet media one
@giffmana
Lucas Beyer (bl16)
2 years
Roadmap to big plagiarism🤦‍♂️ Let me guess the reaction this will get "we identified that ALL the instances were written by ONE of the authors, and we have taken action against him/her." And that scapegoat will be the most junior of them all. I hope I'm wrong though!
17
7
153
4
6
89
@konstmish
Konstantin Mishchenko
1 month
Finished responding to all ICML rebuttals for papers where I am a reviewer, increased the score for one of the papers and explained why I don't agree with the responses for the other papers. Please do that if you're an ICML reviewer too, it doesn't take that much time.
4
4
89
@konstmish
Konstantin Mishchenko
10 months
TensorFlow and Flax set the default momentum in batchnorm to 0.99, while PyTorch uses 0.9. This parameter (β) is used to estimate the running mean as E[x] = β*E[x] + (1-β)*mean(batch), so it appears very important. Did anyone study the optimal choice? I can't find anything.
6
11
88
@konstmish
Konstantin Mishchenko
11 months
Overparameterization empowers deep learning, but in federated learning the data is heterogeneous across clients, which makes it very difficult. We just proved that the solution to this is 𝑜𝑣𝑒𝑟𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛, and it's always achievable:
Tweet media one
0
22
85
@konstmish
Konstantin Mishchenko
2 years
I was lucky to benefit from the former friendship between Russia and Ukraine. I have many Ukrainian friends. Back in 2011, I spent 1 month studying math at a summer camp near Kyiv. I'm Russian and I'm against the terrible war started by Russia. It is so cruel and so unnecessary.
1
7
85
@konstmish
Konstantin Mishchenko
2 years
@karpathy I want to point out that mathematically it matters if you're going over the dataset in random order (epochs) or if you sample random points from the whole dataset (iterations). The former is Random Reshuffling, the latter is SGD; they have different convergence rates.
Tweet media one
3
4
83
@konstmish
Konstantin Mishchenko
3 years
Implicit meta-learning (iMAML) uses implicit gradient updates that remove the need to differentiate through an optimization loop. The objective it minimizes is the sum of Moreau envelopes, so iMAML converges to initialization close to task-loss minimizers.
Tweet media one
2
19
82
@konstmish
Konstantin Mishchenko
3 months
Nothing makes me so happy as seeing some of my work be used by people. Today I noticed D-Adaptation PyTorch implementation has been installed >1M times. Prodigy is getting close to 0.4M. If you want to try it: D-Adaptation: Prodigy:
1
9
83
@konstmish
Konstantin Mishchenko
2 years
I'm thrilled to share that I've been selected as a Rising Star in Data Science by the University of Chicago for my work on optimization and federated learning. It is very encouraging and I'm honored to be part of a great cohort! Official website:
Tweet media one
6
2
83
@konstmish
Konstantin Mishchenko
3 years
The ratings are not final but I decided that anyway I'd mention 10 optimization papers with top scores 💪 Thread! 0/10
@SergeyI49013776
Sergey Ivanov
3 years
ICLR 2021 reviews are public and here is a sorted list for all papers with ratings. What's interesting is that none of the papers scored only 10s. The top-1 is a "graph" paper with 8.25. #iclr #iclr21 #iclr2021
0
72
313
1
13
82
@konstmish
Konstantin Mishchenko
2 months
Had the pleasure of presenting my work on adaptive optimizers at the CISS conference () at Princeton. It was a great event with talks by @KairouzPeter , @srush_nlp , @jasondeanlee and others, I’m very grateful to @chijinML for inviting me!
Tweet media one
1
7
81
@konstmish
Konstantin Mishchenko
4 years
One of the most entertaining linear algebra books: "33 miniatures on linear algebra" by Matoušek I enjoyed all of them but #12 even shocked me: one can prove impossibility of tiling a rectangle by finite # of squares using Hahn–Banach theorem for R over Q.
Tweet media one
1
17
80
@konstmish
Konstantin Mishchenko
1 year
Our paper got accepted as an oral for ICML but my US visa application got rejected, so I'd like to present it where I can. Are there any ML research seminars in London or around where I can give a talk? I can also cover more recent (and very exciting) parameter-free methods!
@konstmish
Konstantin Mishchenko
1 year
Do you like when an optimizer works with stepsize=1? @aaron_defazio and I just released a new stepsize estimation trick called D-Adaptation that does that for Adagrad and Adam. Paper pdf: Code: 🧵Thread 1/7
Tweet media one
6
55
238
3
10
79
@konstmish
Konstantin Mishchenko
4 years
Few people know but the original paper of Nesterov written in 1984 is just 5 pages and very easy to follow. Convergence statement and the proof fit 1 page. Some call it "mysterious algebraic manipulations", but I disagree–it's simply a Lyapunov analysis.
@gabrielpeyre
Gabriel Peyré
4 years
Oldies but goldies: Yurri Nesterov, A method for solving a convex programming problem with convergence rate O(1/k^2). Improves the 1/k rate of vanilla gradient descent to 1/k^2. The idea had a profound impact on mathematical programming.
2
81
330
4
7
78
@konstmish
Konstantin Mishchenko
10 months
The book "The Man Who Solved the Market" about Jim Simons is a must-read for people who love math. My favorite quote, attributed to an accomplished code-breaker Lenny Baum, is “Bad ideas is good, good ideas is terrific, no ideas is terrible.” A great motto for doing research.
1
9
77
@konstmish
Konstantin Mishchenko
1 month
I often mention in my talks that we should worry less about nonconvexity and more about non-smoothness. We can get surprisingly far without worrying about all those saddle points, but ignoring non-smoothness quickly leads to methods that don't work in practice.
@aaron_defazio
Aaron Defazio
1 month
I find the non-smooth convex analysis framework seems to better describe the behavior of neural network training than any nonconvex analysis framework.
6
5
99
5
5
69
@konstmish
Konstantin Mishchenko
4 months
I've never felt so GPU-poor as right now.
Tweet media one
1
4
62
@konstmish
Konstantin Mishchenko
2 years
This week we presented our ICLR 2022 Spotlight work "IntSGD: Adaptive Floatless Compression of Stochastic Gradients". Key idea: adaptively compress float32->int8 on multiple GPUs paper: slides: poster:
Tweet media one
1
16
63
@konstmish
Konstantin Mishchenko
3 years
Let me try to mimic ICML reviewers for these presentations. A thread with 3 reviews. 1. The drawing has some novelty but the overall quality is disappointing. Why the rocket doesn't have wings? All birds have it. Also, without experiments, the drawing is not too interesting.
@peter_richtarik
Peter Richtarik
3 years
#ICML2021 reviews: (thanks to @mher_safaryan for the link)
Tweet media one
0
4
33
1
10
60
@konstmish
Konstantin Mishchenko
2 years
Happy to have 2 papers accepted at #ICML2022 on communication-efficient SGD: 1. Multiple gradient steps before aggregation give provable acceleration: 2. Random Reshuffling (SGD without resampling) with full passes over local data:
9
8
58
@konstmish
Konstantin Mishchenko
9 months
He strongly advocates for a balanced view. Not too exuberant, not too fearful. "Innovation is altogether inevitable and positive." but also "If you were an exuberant AI bro, which there are many down in Silicon Valley, I would have told you to chill and shut up." 2/3
2
4
55
@konstmish
Konstantin Mishchenko
2 years
The recording of my talk about globally convergent Newton method is available on youtube: I was lucky to have many experts on second-order methods in the audience who commented on the results, I hope it’d make the video more fun!
Tweet media one
@konstmish
Konstantin Mishchenko
2 years
Newton’s method is a great heuristic but it doesn’t work globally (and line search doesn’t fix it). Cubic Newton works globally but needs an expensive subroutine. Can we get something in between? I’m proud to present a method that does exactly that: [1/3]
10
46
279
3
11
55
@konstmish
Konstantin Mishchenko
4 years
Dykstra's algorithm works for any number of sets n. If the sets are linear, you recover Kaczmarz algorithm. If you replace projection onto j-th set with prox of some function g_j, you get SDCA. If you include a gradient descent step, you get SDM:
@gabrielpeyre
Gabriel Peyré
4 years
Oldies but goldies: J. P. Boyle and R. L. Dykstra, A method for finding projections onto the intersection of convex sets in Hilbert spaces, 1986. Dykstra’s algorithm is the simplest method to project on the intersection of two convex sets.
Tweet media one
2
49
192
0
8
56
@konstmish
Konstantin Mishchenko
9 months
A paper worth reading!
@maorivg
Maor Ivgi
9 months
Tired of tuning learning rates? Excited to share our #ICML23 paper: DoG, a simple parameter-free optimizer backed by solid theoretical guarantees and showing strong empirical results for over 20 diverse tasks and 8 vision & NLP models. 🐶🧵
Tweet media one
6
83
330
0
7
51
@konstmish
Konstantin Mishchenko
3 years
I always thought that Lasso is just a trick to get a sparse solution quickly. I also thought it's worse than finding the best subset of parameters with same cardinality. Turns out Lasso's regularization might be important, and neither is always better. pdf:
Tweet media one
1
2
48
@konstmish
Konstantin Mishchenko
10 months
Great ICML paper on communication compression: CoctailSGD combines top-K, rand-K and quantization to get the highest compression ratios. Local-update methods, such as ProxSkip, seem to fall far behind together with pure top-K compression. pdf:
Tweet media one
1
11
47
@konstmish
Konstantin Mishchenko
3 years
I really like this paper about variance reduction for operators and minmax problems. Clean convergence proofs, and the algorithm itself has an interesting extra step that makes it slightly different from the extragradient and mirror-prox methods. arxiv:
Tweet media one
2
9
47
@konstmish
Konstantin Mishchenko
3 months
I turned 30 today. Feeling introspective. Any suggestions on things to read or think about?
17
0
47
@konstmish
Konstantin Mishchenko
3 years
Another misconception: line search makes Newton's method work globally for convex functions. Even the books of Nesterov and of Boyd & Vandenberghe suggest using it. But there is a counterexample! I don't understand how it can have only 3 citations... Link:
@keenanisalive
Keenan Crane
3 years
Common misconception: Newton's method provides the "ideal" descent direction, and everything else is a cheap approximation. Couldn't be further from the truth! Newton is great when you're really close to a minimizer… and is a total heuristic everywhere else.
19
135
788
2
7
45
@konstmish
Konstantin Mishchenko
3 years
The discussion period for ICML has started! As an author, I'm often frustrated if the reviewers ignored my rebuttal. As a reviewer, I'm frustrated as the discussion takes too much time. So here are a few tips to minimize your effort but make authors feel you read their feedback.
1
8
44
@konstmish
Konstantin Mishchenko
10 months
Conjugate gradients are commonly used for solving linear systems, but their generalized nonlinear variants do not seem popular these days. Looks like for a good reason: in the worst case, they are not even better than gradient descent.
Tweet media one
1
4
44
@konstmish
Konstantin Mishchenko
10 months
Gradient normalization and clipping are commonly used with SGD or Adam. The theory behind Normalized GD (NGD) is due to Shor (1985), but I particularly like this paper by @prof_grimmer where he shows that NGD works for a wide range of function classes:
Tweet media one
2
5
40
@konstmish
Konstantin Mishchenko
1 year
I am happy to share that I will present this work at ICML 🙂
@konstmish
Konstantin Mishchenko
1 year
How to use a cheap proxy loss function (e.g. simulator or synthetic data) if you care about performance in the real world or on true data? We present a method that can provably minimize the target objective while mostly using the cheaper function! pdf: 1/4
Tweet media one
6
49
349
2
1
41
@konstmish
Konstantin Mishchenko
4 months
@ilgyrd Here it is. They prove that gradient descent converges faster if we periodically take very large steps, contrary to the common belief that one 1/L stepsize is optimal (L is the Lipschitz constant of the gradient)
1
1
40
@konstmish
Konstantin Mishchenko
8 months
Accepted for presentation at NeurIPS!
@konstmish
Konstantin Mishchenko
11 months
The gradient gives you direction to a solution, but it doesn't tell you how far you should go. To remove parameter tuning from SGD, we need to estimate the distance. Today we present DoWG, a method that can do that provably for a wide class of functions:
Tweet media one
2
26
151
0
3
40
@konstmish
Konstantin Mishchenko
10 months
Thrilled to join the Transactions on Machine Learning Research (TMLR) as an Action Editor!
2
1
38
@konstmish
Konstantin Mishchenko
8 months
There seems to be some confusion about the contributions of our D-Adaptation paper with @aaron_defazio , so we will write a blog post about the method and how it was developed. It will likely take us a few weeks, so stay tuned.
0
3
38
@konstmish
Konstantin Mishchenko
4 months
I hope to see more research directed toward practical aspects (adaptivity, nonsmoothness, learning-rate schedules), and less research on combining working ideas for the sake of writing papers. Please share the ideas that you find exciting, I must have missed quite a few! 8/8
7
2
38
@konstmish
Konstantin Mishchenko
3 months
I opened "33 Miniatures" by Matoušek to read something mathematical and pleasant, but this one made me angry. So we just want to prove that one can't tile by squares a rectangle with side lengths 1 and x, where x is irrational. It's simple, right? Geometrical proof maybe? 1/3
Tweet media one
2
1
36
@konstmish
Konstantin Mishchenko
4 years
Gradient descent sometimes converges the same way on completely different datasets. For quadratics, it can be proved using the Marchenko-Pastur law: I'm still wondering though why it holds for SGD and neural nets like resnet-18 and densenet-121.
@gabrielpeyre
Gabriel Peyré
4 years
The Marchenko-Pastur law is the distribution limit of eigenvalues of covariance matrices when samples and size grow together.
Tweet media one
1
63
311
0
10
36
@konstmish
Konstantin Mishchenko
3 years
Distributed optimization requires averaging the gradients computed on different nodes. This may be slower than backprop, so we can save time by using lossy compressions. Error feedback reduces the error by storing it in memory and correcting next updates.
Tweet media one
2
9
35
@konstmish
Konstantin Mishchenko
3 years
SGD, SVRG (convex/non-convex), SAGA, Nesterov's acceleration, adaptive gradient, primal-dual, distributed, and tensor methods—all these algorithms can be analyzed with Lyapunov functions. Some of them do not have known proofs without Lyapunov arguments.
@ahmedallibhoy
Ahmed
3 years
I’m having an especially hard time concentrating on research this week, but instead of doomscrolling on twitter I guess I’ll do something semi-productive and make another twitter thread. Today's topic: What is a Lyapunov function? 1/
Tweet media one
23
281
1K
2
1
35
@konstmish
Konstantin Mishchenko
4 months
In the last two decades, the literature has been drifting from classical and somewhat small problems (linear programming, least squares, SVM, etc) to deep learning. This transition has made a lot of ideas obsolete, while stochastic first-order methods rose in popularity. 1/8
Tweet media one
1
1
35
@konstmish
Konstantin Mishchenko
4 months
Yet, the literature on convex optimization might lead us astray too. Lots of papers concerned with complexity guarantees propose impractical methods that require multiple unknown parameters. What we need is the opposite: adaptive methods that remove the need to know them. 3/8
Tweet media one
1
0
33
@konstmish
Konstantin Mishchenko
3 years
3. My main concern is just the overall incremental nature of the work as the macaroni prototype was done by Gru (2020). The experiments are conducted in a simulated environment in a backyard, not in a proper spaceport. Also, I don’t think it can be useful for deep learning
Tweet media one
1
0
33
@konstmish
Konstantin Mishchenko
4 months
There are, however, many new interesting developments. For instance, (L₀, L₁) setting which can explain the success of gradient clipping. Other interesting topics: heavy-tail noise, properties of normalization layers, noise injections, implicit bias, edge of stability, etc. 5/8
Tweet media one
1
0
33
@konstmish
Konstantin Mishchenko
4 years
Today on arxiv: @gowerrobert @osebbouh and @NicLoizou develop new theory for SGD for finite-sum minimization. They give new, less restrictive bounds for its convergence, discuss optimal minibatches and touch upon Polyak stepsize (my favorite!)
Tweet media one
0
4
32
@konstmish
Konstantin Mishchenko
10 months
I'm a bit surprised but I'm a TMLR Expert Reviewer now!
@hugo_larochelle
Hugo Larochelle
10 months
We have just finalized our first selection of TMLR Expert Reviewers. These are reviewers who have done particularly exemplary work in evaluating TMLR submissions. See the following page for details and the list of reviewers:
2
23
185
1
0
32
@konstmish
Konstantin Mishchenko
4 months
For deep learning in 2024, what benchmarks are best for optimizers? Do LoRA and full-finetuning represent most use cases, or pre-training is still valuable too? What tasks and datasets? Please share your thoughts and code. I'd really appreciate any help in figuring this out.
4
3
30
@konstmish
Konstantin Mishchenko
2 years
The main result is as follows: if the 3rd derivatives are Lipschitz, a variant of regularized Newton method achieves O(1/k^3) convergence, which is the same as the rate of the 3rd tensor method. I.e. we can drop third derivatives and *nothing* changes. To me, this is insane. 2/5
Tweet media one
1
0
29
@konstmish
Konstantin Mishchenko
3 years
2. The authors present a macaroni prototype but it’s just an incremental improvement of the work by Gru (2019), in which he drew the prototype. Making a macaroni prototype based on the drawing is pretty trivial, and it's still not convincing that the rocket will work
Tweet media one
1
0
28
@konstmish
Konstantin Mishchenko
5 months
I'm at neurips for the whole week, I'm always happy to meet new people, so don't hesitate to message me if you want to chat. I'm also presenting this paper () with @ahmedkhaledv2 on Tuesday at 10:45-12:45. #NeurIPS2023
0
3
28
@konstmish
Konstantin Mishchenko
10 months
I gave a talk on adaptive methods yesterday, explaining my recent papers on estimating the stepsize in Adam from how the gradients are correlated with our updates so far. The recording is on YouTube in the quoted tweet. My slides are available here:
@shuvo_das_gupta
Shuvomoy Das Gupta
10 months
Dr. Konstantin Mishchenko ( @konstmish ) virtually visited MIT Operations Research Center yesterday and gave an exciting talk on Zoom about his work on parameter-free adaptive methods for deep learning! His talk is available at the youtube link: !
0
7
34
1
5
27