📢Life update:📢
I moved to Toronto, where I'm now an associate professor at the University of Toronto and an associate research director at the Vector Institute.
I wrote a blog post about the long winding path that led me here:
Announcing a new research focus in my lab: Developing tools to enable collaboratively-built and continually-improved models.
Blog post:
Paper on model "patches":
Paper on "merging" models:
Thread ⬇️ (1/11)
The year is 2012. I am learning deep learning. We pre-train models as denoising autoencoders to provide a better initialization.
The year is 2022. I am teaching deep learning. We pre-train models as denoising autoencoders to provide a better initialization.
A student recently asked me if they should use BERT, GPT-n, or T5 for a simple NLP problem; I recommended a bag-of-words model. Where do I sign up for my curmudgeon license?
New paper! We perform a systematic study of transfer learning for NLP using a unified text-to-text model, then push the limits to achieve SoTA on GLUE, SuperGLUE, CNN/DM, and SQuAD.
Paper:
Code/models/data/etc:
Summary ⬇️ (1/14)
New preprint!
We demonstrate an attack that can extract non-trivial chunks of training data from GPT-2.
Should we be worried about this? Probably!
Paper:
Blog post:
Today is my first day as a faculty researcher at
@huggingface
! I am extremely excited to join this incredible community. Expect awesome things soon! 🤗🚀
FixMatch was accepted at NeurIPS with 7/7/7/7 scores... after being rejected from CVPR and ICML for being "too simple". If you're dealing with a bogus rejection and know your work is good - don't quit, resubmit! Or just post to arxiv and skip the conference review roulette...
I often get emails from enthusiastic new researchers from outside the US. They take free ML courses and develop OSS, but can't afford MS programs, can't get into PhDs w/o publications, have trouble publishing w/o mentorship, and can't get visas for an RAship. Any advice for them?
I'm starting a professorship in the CS department at UNC in fall 2020 (!!) and am hiring students! If you're interested in doing a PhD
@unccs
please get in touch. More info here:
I recently have had a number of aspiring ML researchers ask me how to stay on top of the paper onslaught. Here are three concrete tips:
1) Pick a tiny subfield to focus on
2) Skim
3) Rely on your community
Thread to explain ⬇️ (1/5)
📢 I am hiring PhD students for Fall 2021! 📢
If you want to work with us on semi-supervised/unsupervised/transfer learning and beyond, you should apply:
Also, GRE is optional and we offer need-based admissions fee waivers! Contact me for more info.
This semester I'm teaching a role-playing paper-reading seminar on Large Language Models, covering 57 (!) papers on the good, bad, and ugly of LLMs. Follow along here:
Stages of implementing a machine learning algorithm:
1) Syntax errors
2) Dimension mismatch errors
3) NaNs
4) Model trains, but results are bad
5) Hyperparameter tweaking
...
N) Success!
If you are reeling from a NeurIPS rejection or stressing about an ICLR submission, remember that some of the best papers were never published anywhere except arxiv. Thread of a few favorites (1/5):
Can your NLP model handle noooisy mEsSy
#realworldtext
?
ByT5 works on raw UTF-8 bytes (no tokenization!), beats SoTA models on many popular tasks, and is more robust to noise.
📜 Preprint:
💾 Code/Models:
Summary thread ⬇️ (1/9)
The T5 paper has been published in JMLR! 🎉
Since I have already talked more than enough about T5, here instead is a thread about the (awesome) process of publishing in JMLR:
(1/10)
New blog post: "GANs and Divergence Minimization", which covers the perspective of GANs as minimizing an "adversarial divergence" and draws parallels to maximum likelihood training. Also provides some motivation for better evaluation of GANs.
As of today, I've been an assistant professor for two years. It's been both awesome and difficult. I wrote a blog post about some of the things I've struggled with and how I've coped with them.
Hot take: Mathiness [1] is like an adversarial patch [2] for ML conference reviewers: Mathiness causes a reviewer to classify the paper as "accept" regardless of whether the math is useful/valid and the paper is any good. [3] Fig. 6 has some empirical evidence of this. (refs ⬇️)
New preprint! We introduce 𝚃-𝙵𝚎𝚠 and (𝙸𝙰)³, a few-shot learning recipe that outperforms in-context learning at dramatically lower costs and gets super-human results on the RAFT benchmark for the first time.
📄
💾
🧵⬇️
(1/9)
New blog post where I argue that "large language model development" can be considered a new subfield that grew out of deep learning, NLP, etc. and reflect on what to do when your field of study gives birth to a new one:
How can we recycle specialized PEFT modules to create a generalist MoE-style model? We introduce PHATGOOSE, which learns a post-hoc routing scheme and significantly improves zero-shot generalization.
📜
📝
💾
Also, I am 1000% hiring PhD students this round! If you want to work on
- open models
- collaborative/decentralized training
- building models like OSS
- coordinating model ecosystems
- mitigating risks
you should definitely apply! Deadline is Friday 😬
📢Life update:📢
I moved to Toronto, where I'm now an associate professor at the University of Toronto and an associate research director at the Vector Institute.
I wrote a blog post about the long winding path that led me here:
The t5 library now has a simple API that connects the text-to-text data loading/processing/evaluation pipeline to
@huggingface
Transformers' PyTorch implementation of the T5 models! Here's a usage example:
New blog post about a course format
@_AlecJacobson
and I have been using: the role-playing seminar. It's an alternative to the standard one-presenter-per-class graduate-level paper-reading seminar and is dramatically more interactive, informative, and fun.
Many people are familiar with code smell () but researchers should also have a good sense of "paper smell". Here are some examples for ML papers (thread):
New paper w/
@D_Berthelot_ML
Aurko Roy and
@goodfellow_ian
where we propose an adversarial regularizer for improving interpolation in autoencoders and measure whether it also improves representation learning performance. Paper , code
What's it take for an LLM to learn a fact? And can an LLM tell what's factual and not?
Check out our 💥two💥 new papers!
LLMs Struggle to Learn Long-Tail Knowledge
Evaluating the Factual Consistency of LLMs Through Summarization
I just made this figure for a class I am teaching on "learning from limited labeled data". The left plot represents 6 years of results; the right plot is ~1 year. Anyone else feel like our field is moving kinda fast?
New blog post: "You Don't Know JAX", a brief tutorial which covers the basics of computing gradients, just-in-time compilation, and auto-batching with JAX.
The slides from my talk "A Few Unusual Autoencoders" which I gave last month at
@VectorInst
and
@nyuMARL
are now online: The talk covers MusicVAE, ACAI, and some unpublished "adversarial denoising autoencoder" work.
Super happy that our code for "Realistic Evaluation of Semi-Supervised Learning Algorithms" () is finally released on GitHub: Send us pull requests! Joint work with
@avitaloliver
@gstsdn
@ekindogus
@goodfellow_ian
.
Now that "Do Transformer Modifications Transfer Across Implementations and Applications?" has been accepted to
#EMNLP2021
, we can finally tweet about it!
Paper 📝:
Code 💾:
Thread summary: ⬇️ (1/8)
When and why is it possible to extract training data from large language models?
In a new preprint, we show that the number of times a sequence is duplicated in the training data heavily impacts whether it can be successfully extracted.
Thread⬇️ (1/8)
After ~1 year, my article on building ML models like OSS has been published in the communications of the ACM!
Lots of exciting work in this direction since then and lots to come. If you are interested, join our community:
Announcing a new research focus in my lab: Developing tools to enable collaboratively-built and continually-improved models.
Blog post:
Paper on model "patches":
Paper on "merging" models:
Thread ⬇️ (1/11)
I contributed to the "Learning with Fewer Labeled Examples" chapter of this incredible book. The chapter is a very broad and up-to-date overview of semi-supervised/transfer/meta/few-shot learning, domain adaptation, data augmentation, and beyond.
I am pleased to announce that the camera ready version of my new textbook, "Probabilistic Machine Learning: An Introduction", is finally available from . Hardcopies will be available from MIT Press in Feb 2022.
Last year,
@yisongyue
told me that he has his students meet without him to brainstorm honest collective feedback. I had my advisees do this and it was super helpful, so I wrote a blog post about it:
There is a strange situation in our field: Most people I know and respect (and most people on Twitter in general) agree that "simple is better than complex". But the consensus of the cabal of anonymous, faceless reviewers seems to be the opposite. What is going on?
Reviewers automatically assume that simple is not novel. This is sheer laziness. Yes, it may be simple and obvious in retrospect, but someone had to have that insight first. Simple is good. Simple is robust, easy to implement and reproduce, broadly applicable, etc.
Now that "How Much Knowledge Can You Pack Into the Parameters of a Language Model?" has been published at
#EMNLP2020
(poster at Gather Session 3G, 11/17 UTC-05:00), I can tell you the funny and awful story of how this paper came to be. (1/19)
I gave this talk again today to an audience of CS majors who didn't have any ML experience. It's really rewarding to be forced to explain things like variational inference and autoregressive models without using _any_ technical language.
The slides from my talk "A Few Unusual Autoencoders" which I gave last month at
@VectorInst
and
@nyuMARL
are now online: The talk covers MusicVAE, ACAI, and some unpublished "adversarial denoising autoencoder" work.
New pre-print! Monotonic Infinite Lookback Attention (MILk): an online attention mechanism which we applied to simultaneous machine translation. It allows the model to attend to the entire input sequence up to a location set by a monotonic attention head.
Hot take: When evaluating a self-supervised model's performance on a new task without fine-tuning, don't call it "zero-shot"; call it "weakly supervised multi-task". These models only succeed when their unsupervised pre-training actually provides weak supervision for the task.
New paper with Chung-Cheng Chiu: Monotonic Chunkwise Attention (MoChA), an online/linear-time attention mechanism which computes soft attention over small chunks with adaptively set boundaries. Matches the performance of (offline) softmax attention on WSJ!
📣 Announcing the ICLR 2021 Workshop on Enormous Language Models 📣
We have an incredible speaker lineup that covers building, evaluating, critiquing, and improving large LMs, as well as a collaborative parcipant-driven benchmark and 2 panels!
More info:
In case you missed our
#neurips
poster on MixMatch () today because you aren't in Vancouver or didn't survive the poster session stampede, here's the PDF: and here's a transcript of what I said to everyone who came by: ⬇️ 1/11
New preprint! We introduce a simplified version of pattern-exploiting training called ADAPET. ADAPET outperforms PET and iPET on SuperGLUE without using task-specific unlabeled data or ensembling and beats few-shot GPT-3 with a much smaller model.
I am reading "A Neural Probabilistic Language Model" in detail for the first time and wow is it a fun read - discusses and justifies word embeddings, advocates scaling up models and data, uses rudimentary data- and model- parallel training... all done from scratch on CPUs.
Single-blind: Reviewers know author's identities
Double-blind: Reviewers don't know author's identities
Triple-blind: Reviewers must write reviews without reading their assigned submissions
Quadruple-blind: Authors are never told if their paper was accepted or rejected
...
I recently came across , which "assumes 2-3 runs" of T5-11B. In fact, we trained T5-11B *once*. That's why we spend 35 pages figuring out how we should train before we start training. You don't want to mess up a training run that big.
We showed last year (with OpenAI co-authors!) that it's surprisingly easy to extract verbatim training data from large LMs:
It kind of boggles my mind that they included GPL'd source code in the training set for this model.
Protip: if a random person asks you what you do and you want to avoid talking about the singularity, Sophia the robot, or "Facebook had to shut down AI when it invented its new language", just say "statistics".
I somehow missed this great paper by
@tuvuumass
et al.: They learn "task embeddings" (a la task2vec) for NLP tasks and show how they can be used to predict the effectiveness of intermediate-task transfer. Lots of experiments and a promising direction!
Mind-boggling results on the final EfficientQA leaderboard: The best system beat the REALM baseline by almost 20 points, and a 30 megabyte model got > 25% accuracy! Looking forward to hearing more about these systems at NeurIPS.
The mT5 paper was accepted to NAACL 🎉 so now we can stop pretending that it doesn't exist! Updated arxiv with many juicy new results, including a simple way to prevent "accidental translation" exhibited by generative models in zero-shot settings.
We are releasing mT5: A massively-multilingual version of T5 that supports over 💯 languages! mT5 was pre-trained on a multilingual version of C4 and achieves SoTA on many cross-lingual NLP tasks.
📜Pre-print:
💾Code/models:
As a contributor to this book, I've been offered a free copy. However, I don't know what I'd do with an actual physical book in 2022. If you'd like my copy, please reply with a < 280 character description of the benefit you'd get from receiving a copy and I'll pick a recipient.
I am delighted to announce that my new book, “Probabilistic Machine Learning: An Introduction”, is finally available in print format! You can order it from , or from Amazon. Also available at 1/4
In 15 minutes I'll be giving a talk on "The Benefits of Unified Frameworks for Language Understanding" at the "Conceptual Understanding of Deep Learning" workshop (). Livestream here:
I think we need a taxonomy of adjectives for describing neural network size.
"Large neural networks"
"Outrageously large neural networks" ()
"Ridiculously large neural networks"
"Inconceivably large neural networks"
"Uncomfortably large neural networks" ...
I saw this paper when it was presented at NeurIPS 2018 and really enjoyed it. It's worth a read for anyone who works on or thinks about generative models.
If all training images for a GAN/VAE/PixelCNN have 2 objects, will they only generate images with 2 objects? If trained on (🔵,💙,🔴), will they also generate ❤️? Find out in
@shengjia_zhao
's blog post on generalization and bias for generative models.
👉
Hot take: The most surprising thing about BERT isn't how well it worked when it was proposed, but how much better it would have worked if they had just pre-trained for longer on a more diverse dataset.
NeurIPS95 "Learning to Learn"
workshop focused on "unsupervised learning on a large corpus of unlabelled data to learn features for subsequent supervised learning on a smaller labelled corpus" and "using models previously learned for other problems when learning new problems" 🤔
Should we agree as a field not to post ICLR submissions on arxiv until after the review period is over? The paper is already public thanks to OpenReview, so it can (and should) be cited as existing work. arxiv'ing only serves to deanonymize it, which is probably a net negative.
A video of my talk "Doing Strange Things with Attention" which I gave at AI
@WithTheBest
in October is now online: Covers feedforward attention, sequence embedding using attention, monotonic attention, and a new variant called MoChA.
Presenting BiT, an open-source approach for large-scale pre-training of models covering a wide range of visual tasks, which highlights the importance of choices in the model architecture for downstream performance. Learn all about it below:
#neurips
tips day 5 (h/t
@chris_j_beckham
)! Conferences are a parade of successes. Remember that for every impressive paper there are many (unpublished) ideas that didn't pan out. Take this opportunity to ask people about negative results!
The
#ICLR2021
Workshop on Enormous Language Models (WELM) is tomorrow, May 7th!
Full info:
Livestream:
gathertown info for ICLR registrants:
Thread summarizing the talks & panels ⬇️ (1/14)
Today, the T5 team competed against T5 in a "pub quiz" on (context-free) questions from the TriviaQA/NQ validation sets. We LOST! We only got 20% right; T5 got 35%. To see how to fine-tune T5 on context-free QA (or any other task) with a free TPU, check out our Colab tutorial ⬇️
As promised, we have made the Text-To-Text Transfer Transformer (T5) models much easier to fine-tune for new tasks, and we just released a Colab notebook where you can try it yourself on a free TPU!
👇
(1/3)
I actually encourage my students & colleagues to get on Twitter, because (for better or worse) it remains the best place to find out about new papers. Most of the time, I only check a filtered version of my timeline that only shows tweets with an link. 🤷
New work w/
@yaoqinucsd
, Nicholas Carlini,
@goodfellow_ian
, and Gary Cottrell on generating imperceptible, robust, and targeted adversarial examples for speech recognition systems!
Paper:
Audio samples:
Google has open-sourced Lingvo, which is the excellent codebase we used for the Monotonic (Chunkwise) Attention papers! Has also been used in dozens of other Brain papers. Code: Pre-print:
Protip: It is not too late to apply to start a PhD in Fall 2020 at the UNC CS department! The deadline for applications is, amazingly, not until March 10th.
I'm starting a professorship in the CS department at UNC in fall 2020 (!!) and am hiring students! If you're interested in doing a PhD
@unccs
please get in touch. More info here:
TIL that ICLR is the
#1
conference in "Artificial Intelligence" according to Google Scholar Metrics () but it's still not included in . All rankings are silly and arbitrary, but this seems especially silly and especially arbitrary.
2) Skim
You'll find that many papers within your subfield of choice have a lot in common - there is often only a small nugget of novelty in each paper. It's incredibly important to develop your ability to find this nugget as quickly as possible. (3/5)
I finally put up the slides for my faculty job talk from last year:
They are now pretty out-of-date but I spent a ton of time making them fancy and clear. Includes overviews of a few frameworks for deep generative modeling, +MoChA/MILk, MusicVAE, and ACAI.
The camera ready version of "Realistic Evaluation of Deep Semi-Supervised Learning Algorithms" is now up on arxiv: Includes an entire bonus page, two new tables, a new figure, and a couple of new experiments!
PSA: If a paper on a generative model of images only presents results on MNIST/SVHN/CelebA, you should be skeptical that it will work in general. These datasets are extremely regular - they are normalized so that objects tend to appear in the same location/orientation.
We're having a *debate* at the Transfer Learning for NLP workshop
@NeurIPSConf
this year.
@kchonyc
is one of our debaters; the other one can't make it to NeurIPS anymore 😢 Who do you want to see go toe-to-toe with Cho?