This paper looks like a big step forward for the Transformer architecture!
A foundational improvements, not as shiny as other things, but really big step forward nonetheless
Meta researchers just dropped PyTorch distributed shampoo🧴few days ago: 💥
Train neural networks with a second order method for better performance.
This underlying work which it is based on has been a passion project for last 5 years while swimming…
It’s been a privilege to work alongside with our gemini leads and team (across Google DeepMind, Research and Alphabet) in one of the most interesting and challenging projects of my career.
We have three versions of Gemini:
(a) Ultra (b) Pro and (c) Nano
We make significant…
I’m very excited to share our work on Gemini today! Gemini is a family of multimodal models that demonstrate really strong capabilities across the image, audio, video, and text domains. Our most-capable model, Gemini Ultra, advances the state of the art in 30 of 32 benchmarks,…
A new image generation model just dropped.
Great work by the team!
+ Auto-regressive, encoder->decoder Transformer
+ Classifier-free sampling.
+ ViT-VQGAN
Really amazing results: Image from the website.
Shampoo is out of the bottle!
Preprint: "Second order optimization made practical"
We train certain neural nets faster than before.
How fast ? It has shown upto ~40% reduction in training time for a Transformer.
(
@tomerikoriko
)
PaLM-2 is Generally available for developers!
“With this update, developers can access our text model powered by PaLM 2, Embeddings API for text, and other foundation models in Model Garden”
Today, we present our paper on Google Search Ads CTR model at ORSUM
@ACMRecSys
, Seattle.
We highlight ML techniques suited to *online learning* that go well beyond traditional accuracy improvements.
A short thread:
1/n
Prompt: "A koala bear and grizzly bear playing chess. They are sitting at a table on the beach. You can see the waves crashing into the shores. Bears are very stressed. DSLR camera photo."
#imagen
#googleai
#brain
🐻🐨♟️🏖️
Batch Entropy Regularizer that makes untrainable networks train. Remove skip connection, normalization layers. Published at TMLR, Works on PaLM like transformers -- thanks to Lucid for the pointer!
Palm2 is online: 🌴🌴
Paper:
I learned to code with instructions in Malayalam, so this capability shown by PalM-2 instruction tuned models to explain the code make me quite happy!
Possibilities are endless here!
🌴🌴
Very proud of this work; specifically not compromising on model quality, while being extremely fast for inference, so that we can serve the whole wide world i.e bringing technology to everyone!
@karpathy
@giffmana
The team is working hard to bring audio inputs to the AI Studio interface for Gemini 1.5 Pro. We have an internal version that handles audio and video and can sample the video less frequently to increase the length of content that can be handled.
@karpathy
, thanks for the…
L👈: "A Koala bear in a suit standing at a podium to teach. Variational bayesian methods is written on the chalkboard. There are lot of confused cats in the crowd"
R 👉:"Variational bayesian methods is all you need is written on the chalkboard."
🐨🙀
#imagen
#googleai
#brain
Prompt: "A train ride in the monsoon rain in Kerala. With a Koala bear wearing a hat looking out of the window. There is a lot of coconut trees out of the window"
#imagen
#googleai
#brain
(I will host the imagen team at my home in Kerala if they choose to visit 🚀)
GPT-4 can do well on MIT test
Community: oh the methodology is all wrong
🌶️ Introducing new optimizer that is 2x faster than AdamW
Community: Impressive! Impressive methodology!
Said methodology: use half the steps for new method and change learning rate schedule to…
Code for Distributed Shampoo: a scalable second order optimization method
💥
Joint work w
@GuptaVineetG
State of the art on MLPerf ResNet-50 training to reach 75.9% accuracy at 32,768 batch size
Trains in 1729 steps (not a typo), 284 secs on TPUs.
Code for SM3 which is a memory efficient adaptive first-order optimizer is now open-sourced under
@GoogleAI
research repository. It's useful for training very large language models, for eg: BERT-Large, GPT2 etc.
I completely missed the Parallel Layers used in PaLM. Its makes training 15% faster at larger scale. Mainly run MLP and Attention together! Thanks
@achowdhery
for pointing this out to me! The savings in compute are quite substantial.
“For example, if the traditional algorithm taught in school multiplies a 4x5 by 5x5 matrix using 100 multiplications, and this number was reduced to 80 with human ingenuity, AlphaTensor has found algorithms that do the same operation using just 76 multiplications.”
Today in
@Nature
:
#AlphaTensor
, an AI system for discovering novel, efficient, and exact algorithms for matrix multiplication - a building block of modern computations. AlphaTensor finds faster algorithms for many matrix sizes: & 1/
Some excellent work by
@jeankaddour
and colleagues
“We find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate”
☠️
Tinker with this visualization here for training neural networks with noise added in the dataset. Made with tensorflow.js and inspired by neural network playground. 👇
10 years ago I left working on iOS communicator at MSFT to work on machine learning at Google, without much connections or a doctoral degree for that matter.
Crazy how time flies! And due to a bunch of lucky breaks, very thankful to be doing ML things at Google 🧠
We're introducing an optimizer for deep learning, MADGRAD. This method matches or exceeds the performance of the Adam optimizer across a varied set of realistic large-scale deep learning training problems.
Next big jump with Neural Network performance is going to happen when community embraces non-uniformity
Eg, stacking of identical layers has become ingrained within our tools and mindsets.
Gen AI on-device? A foundation model on the phone?
Imagine an entire operating system level unlock of capabilities:
Well Pixel 8 Pro will have it. Rick announced it here:
The model was trained with several algorithmic breakthroughs by our team to…
Gemini Nano improve on the efficiency frontiers. They are multimodal as well, see results in the paper.
Nano series: At 1.8B and 3.25B parameters packs so much to provide high utility on device
First foundation model on the device!
Gemini Nano is super efficient for tasks that are on-device. Android developers can sign up for an early access program for Gemini Nano via Android AICore and Pixel 8 Pro users can already see it rolling out in features like Summarize in Recorder and Smart Reply in Gboard + much…
Just tested it on a paragraph from one of my papers, and it does seem like it improves the writing.
Sure, if you generate whole papers with a LM thats not cool but improving the writing quality seems good for everyone?
People shocked that StableDiffusion was trained with less resources haven’t been paying attention to many things including Craiyon/DalleMega runs. Scale is not all you need dear community. Nice to write that in a paper though.
First of all, massive congratulations are in order to
@zacharynado
@GeorgeEDahl
@naman33k
and co-authors on this massive work spanning multiple
years on benchmarking neural network training algorithms! 🎉🍾
I have a horse 🐴 in the race and its called distributed shampoo 🦄
Benchmarking Neural Network Training Algorithms
Presents AlgoPerf, a new, competitive, time-to-result benchmark using multiple workloads running on fixed hardware.
Augments Transformer 🤖 architecture with n-grams that are constructed from discrete latent representation of the text sequence.
Faster training and inference when it matters the most - as core operations are (distributed) gather/scatter. 🎇
Code:
N-Grammer: Augmenting Transformers with latent n-grams
abs:
propose modification to the Transformer architecture by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence
Finishing up slides now.
I will be talking about “Scalable second order optimization for deep learning” at Deep Learning: Classics and Trends, tomorrow.
Called it with some knowledge about the model.
Ultra is going to break ground!
Those quibbling over hellaswag and mmlu is just showing their misunderstanding about evaluation.
Onwards 🚀
🔥Breaking News from Arena
Google's Bard has just made a stunning leap, surpassing GPT-4 to the SECOND SPOT on the leaderboard! Big congrats to
@Google
for the remarkable achievement!
The race is heating up like never before! Super excited to see what's next for Bard + Gemini…
1/3 I think we should coin a new term: social media AI researcher, where instead of publishing your work at the rigorous peer review venue, you tweet about your findings and opinions.
There are huge advantages:
1. It is easy: you don't have to deal with that annoying reviewer 2.
The AGI I want is one that realizes I made a dumb mistake with batch size which makes it OOM on a supercomputer and tries a smaller one for me - while I am sleeping so I don’t have to babysit the models and increases the throughput in experimentation!
"Ever want to learn how JAX works, but the implementation seemed impenetrable?
Well, you're in luck! By reading this tutorial, you'll learn every big idea in JAX's core system. You'll even get clued into our weird jargon!"
Jax team keeps exceeding every expectation 😂
Gemini 1.5 Pro - A highly capable multimodal model with a 10M token context length
Today we are releasing the first demonstrations of the capabilities of the Gemini 1.5 series, with the Gemini 1.5 Pro model. One of the key differentiators of this model is its incredibly long…
AdaGrad and Shampoo with aggregated second moments now works for deep learning.
This is quite similar to the Grafting technique we introduced to disentangle the direction from step size, where we have also found identical results.
More exciting news today -- Gemini 1.5 Pro result is out!
Gemini 1.5 Pro API-0409-preview now achieves
#2
on the leaderboard, surpassing
#3
GPT4-0125-preview to almost top-1!
Gemini shows even stronger performance on longer prompts, in which it ranks joint
#1
with the latest…
Distributed Shampoo,
ICML: 4 diverse workloads.
AC: do another 7 ablations costing a million dollars 💵 for us to believe you have beaten Adam.
ICLR: Beats every ML perf workload on wall-clock time.
AC: I am a distributed system expert and I don’t believe you (charitable…
Fun game. Clocking 17973 citations: "Distilling the knowledge in a neural network"
@geoffreyhinton
,
@OriolVinyalsML
,
@JeffDean
Reviewer 38 (NeurIPS 2014): "This work is incremental and unlikely to have much impact even though it may be technically correct and well executed."
Everyone looking at metrics and demos out here and debating nuances of evals which is fine
But missing the point these models are in Bard and Pixel 8 Pro right now and coming to more surfaces.
Everyone sad top AI conferences are virtual this year.
Folks, online conferences actually allows lot more people to attend who otherwise would be not due to weaker passports.
I only attended 1 conference outside which was in Canada, and don’t want to go through the pain again
My career so far was built over real world deployed ML. So take it with a grain of salt. 🧂
H-index and citation counts are weakly correlated with usefulness or reality. 🤷♂️
So read the papers to judge it not the citation counts.
My understanding is that Google Scholar is the pet project of one small team. It's crazy how little design choices (e.g., default sort papers by # of cites, total # of citations prominently displayed) influence all of academia by making citations a default "measuring stick".
Prompt: "A koala bear in a suit at a dining table reading a newspaper and drinking tea contemplating. Photo taken by a DSLR camera."
#imagen
#googleai
#brain
Inspiration from "I should buy a boat" cat meme.
Quite spectacular results. 🐨☕️📰
New phrase learned today from staying up on twitter. “LLM doping”
Who wants to make a doping test and an agency that checks llm for eval doping?
I would cut a check for starting something along these lines.
I tried this in malayalam too last night and it blew my mind!
I am very excited by this, studying can be much more effective if there was a personalized tutor who could explain steps on “how” and “why” along the way. I see this future to be immensely positive!
The multimodal and reasoning capabilities of Gemini are quite strong. The benchmark results, which I’ll discuss in a moment are nice, but I’m most excited by demonstrations of what it can do.
Consider the image below. A teacher has drawn a physics problem of a skier going down…
From ICLR, Jorge provides a fast approximation for the inverse 4th roots in Shampoo!
I recommend implementing the stable & fast coupled newton inverse, but maybe for some problems computing approx inverse pth root more often could be useful
@Sci_j_my
They were several papers who couldn't cite your papers because of TOTAL failure on bibtex, they didn't get to compile twice. BibTex is now switching parts of it code.. have you heard of that? my people tell me that. We had tremendous citation, with certified by reviewers.
Really enjoyed reading Nichol & Dhariwal
This plot was the most interesting.
Optimizing vlb(variational lower bound) was harder than a simple means squared error + lambda Lvlb
Then they wave a magic wand to fix this, the green curve 1/2
🌴🌴
Very proud of this work; specifically not compromising on model quality, while being extremely fast for inference, so that we can serve the whole wide world i.e bringing technology to everyone!
*cracks knuckles*
and thus, we begin the "🌴PaLM v2" drinking game (but with coffee, tea, or your favorite caffeinated beverage of choice, as it's early! 😉)
#GoogleIO2023
#GoogleIO
Has anyone done large-scale profiling of inference speeds for different LLMs of comparable accuracy from different providers?
Gemini Pro seems incredibly faster, from my personal experience, than, say, GPT-3.5. Seeing some numbers w/ error bars on this would be nice.
Prompt: "Photorealistic koala bear wearing a tie dye tshirt. The koala bear is wearing a sunhat and aviator glasses. koala bear is inside a houseboat in Kerala. There is a lot of coconut trees in the background."
#imagen
#googleai
#brain
🚀
I have the same t-shirt!
👕🐨🥥🌴
Frustrating part of deep learning is that almost anything works, so for those wanting to know the why something works, it’s just endless pit of misery and unanswered questions
Bottleneck layer is the layer that has fewer neurons than layer above / below. ⌛️
Then, Whats a layer that has more neurons than the layer below and above called?
A 10x layer? A booster layer? Need help.
Incredible! Found linear mode connectivity on test loss (right) not on train !
My mind is blown -- this is huge !?
Updates (
@stanislavfort
's colab)
+ With DistributedShampoo(~0 train loss 🚀)
train/test loss=0.0002/0.314 vs (0.350/0.333)
train/test accuracy=0.999/0.982
I reran Stan's colab (thank you for the colab!) with DistributedShampoo instead of Adam or SGD and got this.
Everything looks connected. Hmm? is this a bug???
Want to know more? Well, you have to wait for a bit.
Boris Dayma’s guide to training large models, a must read.
He is using a second-order method (distributed shampoo) for all his training making him a handful of humans on earth who know how to deploy it correctly.
Results speak for itself:
Check it out!
📉 "A Recipe for Training Large Models"
👉 Report:
I've been working for a while on this guide, sharing practical recommendations with my simple recipe for training models 🧑🍳
We with
@quocleix
used Gemini Advanced yesterday to brainstorm for an internal research week debate. It was quite incredible experience and an effective companion in creative brainstorming and nothing else compares.
A small thread on related work on more-than diagonal optimization, in context of neural network.
Kronecker Factorization: reducing cost from (mn)^2 to m^2 + n^2 comes from this sparsely cited paper from Heskes, 2000
See MLP section.
PSA: Switch your optimizer to Shampoo!
We recently tried Shampoo compared to a tuned ensemble of Adam and SM3 at
@HomebrewNLP
and found that the hyperparameter search space contains many more "winning tickets," which also achieve lower losses!