Announcing MatFormer - a nested🪆(Matryoshka) Transformer that offers elasticity across deployment constraints.
MatFormer is an architecture that lets us use 100s of accurate smaller models that we never actually trained for!
1/9
New EMNLP paper “Investigating Multilingual NMT Representation at Scale” w/
@ankurbpn
,
@orf_bnw
,
@caswell_isaac
,
@naveenariva
. We study transfer in massively multilingual NMT
@GoogleAI
from the perspective of representational similarity.
Paper: 1/n
Tired: Catching imposter syndrome by reading PhD applications from students way smarter than you.
Wired: Getting excited about talking them into building cool things with you ✨
We wrote a blogpost about our work on Task-level Mixture-of-Experts (TaskMoE), and why they're a great way to efficiently serve large models (vs more common approaches like training-> compression via distillation).
Read all about Task-level Mixture-of-Experts (TaskMoE), a promising step towards efficiently training and deploying large models, with no loss in quality and with significantly reduced inference latency ↓
Late tweet, but thank you ENSLP
#NeurIPS2023
for the best paper award, and
@Devvrit_Khatri
for the excellent presentation on behalf of the team
@adityakusupati
!
Excited to push further on conditional computation for tiny fast flexible models 🚀
Announcing MatFormer - a nested🪆(Matryoshka) Transformer that offers elasticity across deployment constraints.
MatFormer is an architecture that lets us use 100s of accurate smaller models that we never actually trained for!
1/9
New EMNLP paper “Investigating Multilingual NMT Representation at Scale” w/
@ankurbpn
,
@orf_bnw
,
@caswell_isaac
,
@naveenariva
. We study transfer in massively multilingual NMT
@GoogleAI
from the perspective of representational similarity.
Paper: 1/n
@ankurbpn
@orf_bnw
@naveenariva
@GoogleAI
Huge thanks to my collaborators at
@GoogleAI
, without whom this work would not have been possible. This work was done as a part of the Google AI Residency - applications open soon, so definitely check it out!
8/8
I'm at
#NeurIPS2023
today presenting MADLAD-400 with
@BZhangGo
and
@adityakusupati
at 5:15pm in Hall B1/B2
#314
! Come by and chat w/ us about creating *massive* datasets, making sure they're not garbage, and multilingual LMs :D
📢🪆MatViT-B/16 & L/16 model checkpoints & code are public - drop-in replacements that enable elastic compute for free!🔥
Try them out; let us know😉
Shout out to
@kfrancischen
for the release;
@anuragarnab
&
@m__dehghani
for the amazing Scenic library.
New research demonstrates how a model for multilingual
#MachineTranslation
of 100+ languages trained with a single massive
#NeuralNetwork
significantly improves performance on both low- and high-resource language translation. Read all about it at:
We just released the MADLAD-400 dataset on
@huggingface
! Big (7.2T tokens), remarkably multilingual (419 languages), and cleaner than mC4, check it out:
Kudos to
@996roma
for doing the analysis of linguistic phenomena in RxR, and many thanks to
@snehaark
for working with Roma for Telugu! Also, to
@yoavartzi
and team for establishing this overall approach with Touchdown.
Reasons to hire Aditya:
1) v cool representation learning research with real world impact
2) genuinely cares about and bats for his mentees and collaborators
3) vibes are immaculate ✨
📢📢At the last minute, I decided to go on the job market this year!!!
Grateful for RTs & promotion at your univ.😇
CV & Statements:
Will be at
#NeurIPS2023
! presenting AdANNS, Priming, Objaverse & MADLAD. DM if you are around, would love to catch up👋
New EMNLP paper “Investigating Multilingual NMT Representation at Scale” w/
@ankurbpn
,
@orf_bnw
,
@caswell_isaac
,
@naveenariva
. We study transfer in massively multilingual NMT
@GoogleAI
from the perspective of representational similarity.
Paper: 1/n
How many languages can we support with Machine Translation? We train a translation model on 1000+ languages, using it to launch 24 new languages on Google Translate without any parallel data for these languages. Technical 🧵below: 1/18
However, crawled datasets are often noisy, and is even worse for under-resourced languages, with many datasets having data that is not even in the labeled language (). So, we self-audited our initial dataset, and kept only 419 languages of 498. 3/n
I'm at
#NeurIPS2023
today presenting MADLAD-400 with
@BZhangGo
and
@adityakusupati
at 5:15pm in Hall B1/B2
#314
! Come by and chat w/ us about creating *massive* datasets, making sure they're not garbage, and multilingual LMs :D
Data cleaning, documentation and auditing practices beyond English still have a long way to go, and we hope that this work furthers work in this area! 6/n
I used to use a small notebook to set my agenda, but I started a version of this after reading
@deviparikh
's post .
The most immediate benefit was the reduced mental overload of remembering to talk to people/returning stuff.
What do we need to scale NLP research to 1000 languages? We started off with a goal to build a monolingual corpus in 1000 languages by mining data from the web. Here’s our work documenting our struggles with Language Identification (LangID):
1/8
Manual smell tests of your data are limited, but super useful! I would love for all new large scale datasets to define their own audits AND release the results in all its messy glory.
Great work by Sneha to create a new, open, and highly multilingual web dataset...with a great acronym! It also sets a nice precedent that every single one of the 419 languages in the crawl was looked at and considered for specific filtering.
An early draft of the machine learning interviews book is out 🥳
The book is open-sourced and free. Job search is a stressful process, and I hope that this effort can help in some way.
Contributions and feedback are appreciated!
Measuring the social impacts of foundation models for as many languages we support is super important -
@Chris_Choquette
and
@katherine1ee
led some intriguing work investigating the memorization properties of multilingual models.
419 languages is so many languages (!!)
Side note:
We investigated how having lots of different languages in one model impacts what and how much is memorized. Which examples get memorized depends on what other examples are in the training data!
@ankurbpn
@orf_bnw
@naveenariva
@GoogleAI
We also find that representations of high resource and/or linguistically similar languages are more robust when fine-tuning on an arbitrary language pair, which is critical to determining how much cross-lingual transfer can be expected in a zero or few-shot setting. 6/n
Massively Multilingual NMT in the wild: 100+ languages, 1B+ parameters, trained using 25B+ examples. Check out our new paper for an in depth analysis:
#GoogleAI
Is it just me or is the worst part about transitioning from school to the industry getting around the fact that you're productive only at 10am and 10pm.
@TaliaRinger
No, go for it! For North Indian weddings you should be fine with either; for a South Indian wedding I'd err on the side on wearing a sari (ask the host!). If you want to wear a sari, book an appointment with a local salon to get someone to tie a sari for you it's less stressful.
Me going over my Google Keep notes near the end of the week:
Note: Talk to xyz
Me: Who is xyz?
Note: Weekend
Me: Yes, it exists.
Note: what isuseful?
Me: Not you, clearly. 🙄
OK y'all, it is now my pleasure to show off some of the truly, genuinely heinous plots students in my Reproducible Data Analysis class made.
Content warning: these plots are f***ing awful.
My parents started out teaching me both Telugu and English, but prioritized the latter for far too long for similar reasons. I took Telugu class for ~8 years when we moved back to India, but it simply isn't the same.
My older sister's kindergarten teacher told my parents to stop speaking Chinese at home or she would struggle, so my dad spoke only English to us. My mom spoke a mix, and my grandparents spoke only Chinese, so we still learned some Chinese, but the message was assimilate or fail
XLNet: a new pretraining method for NLP that significantly improves upon BERT on 20 tasks (e.g., SQuAD, GLUE, RACE)
arxiv:
github (code + pretrained models):
with Zhilin Yang,
@ZihangDai
, Yiming Yang, Jaime Carbonell,
@rsalakhu
Easily deploying large models is an important direction of research, and we believe TaskMoE is a promising step towards more inference friendly algorithms that retain the quality gains of scaling. 9/9
The focus on SOTA has caused a dramatic increase in the cost of AI, leading to environmental tolls and inclusiveness issues. We advocate research on efficiency in addition to accuracy (
#greenai
). Work w/
@JesseDodge
@nlpnoah
and
@etzioni
at
@allen_ai
- Write a novel or collection of short stories with women who both work on sciencey things and have an ok personal life.
- Perform Standup.
- Write/direct a film with women who both work on sciencey things and have an ok personal life.
- Paint enough for an art exhibition.
I want to:
- Publish a fantasy novel
- Perform standup comedy
- Produce an animated film
What do you want to do that's outside of your traditional "career" trajectory?
The Guardian is updating our style guide to accurately reflect the nature of the environmental crisis.
“Climate change” —> “climate emergency, crisis or breakdown”
“Global warming” —> “global heating”
“Climate skeptic” —> “climate science denier”
@chipro
Speculation, but I'm certain more than 8% of women viewing are interested: women of twitter, never hesitate to reach out! I caught myself doing this recently - when
@chipro
organized Brunchpropagation, I deffo held back for a bit thinking "Why would anyone want to talk to me?"
Finally, when scaling up to 200 language pairs, our 128-expert task-MoE (13B parameters) still performs competitively with a token-level counterpart, while improving the peak inference throughput by a factor of 2.6x. 8/n
The Pokémon paper is out! In this study, we scanned the brains of adults, who as children, became Pokémon experts. We find a region of their brain becomes uniquely responsive to
@Pokemon
, helping us get at why the brain is organized the way it is.
Took a while (don't ask) but here they are: Notes from "Science of Deep Learning" class co-taught with
@KonstDaskalakis
now available: . More coming soon (promise!). Feedback very welcome! Thanks to
@andrew_ilyas
for heroic effort on doing final revisions.
Another significant advantage of TaskMoE is that we retain all the gains from scaling - our method is +2.1 BLEU on average across all languages vs distilling the TokenMoE to a student model with size comparable to the subnetwork extracted from TaskMoE. 7/n
1/ Can we use model-based planning in behavior space rather than action space? DADS can discover skills without any rewards, which can later be composed zero-shot via planning in the behavior space for new tasks.
Paper:
Website:
So we route tokens according to broader categories (route by task boundaries vs route per token) - that is, every token of a language is routed to the same subnetwork.
This enables the model to dedicate a fewer experts to a single task identity during training and inference.4/n
I've made this cheat sheet and I think it's important. Most stats 101 tests are simple linear models - including "non-parametric" tests. It's so simple we should only teach regression. Avoid confusing students with a zoo of named tests. 1/n
Mixture-of-Experts (or MoE) models are a great way to scale! Researchers have successfully scaled multilingual neural machine translation (MNMT) models up to 1 Trillion parameters and beyond. 2/n