Konstantin Mishchenko
4 months
There are, however, many new interesting developments. For instance, (L₀, L₁) setting which can explain the success of gradient clipping. Other interesting topics: heavy-tail noise, properties of normalization layers, noise injections, implicit bias, edge of stability, etc.
5/8