📢 New paper!!
Do LLMs understand self-referential statements? Introducing “I am a Strange Dataset”. All tested models perform around chance at our metalinguistic self-reference task.
GPT-4 is the only model significantly above chance on all tests, but it is slight.🧵
I'm excited to announce that we've added the very first
@OpenAI
human-feedback dataset to the Hugging Face Hub!
Check it out if you have interest in
#ChatGPT
and Reinforcement Learning from Human Feedback. The dataset is from the awesome WebGPT paper.
Life update: I’ve decided to join Stanford as a PhD student. Beyond happy for the chance to collaborate closely with the incredible researchers in the NLP group and broader AI lab!!!
We’re going to do it! We’ll train and release masked and causal language models (e.g. BERT & GPT-2) on new Common Crawl snapshots as they come out! We call this project Online Language Modeling (OLM). What applications or research questions can we enable or help answer? A 🧵:
Super excited to say that our Online Language Model project has reached a huge milestone! We are now releasing a RoBERTa/BERT and GPT2 trained on up-to-date data, every month or so.
But how do they do on standard benchmarks? Typically, better than the originals!
A 🧵…
For our Online Language Modelling (OLM) project, we’ve open-sourced end-to-end code to turn the latest Common Crawl and Wikipedia web snapshots into clean datasets for pretraining models like BERT and GPT-2: . What are the details? A 🧵:
Happy to announce our new CVPR paper - Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality.
All tested SOTA multimodal models perform very poorly on our new vision-language eval dataset.
Paper:
#CVPR2022
,
#NLProc
1/5
A new
@OpenAI
human-feedback dataset is on the Hugging Face Hub!
This one is from the "Learning to Summarize from Human Feedback" paper, where the amazing authors trained an RLHF reward model for summarization.
We released initial pretraining datasets for the Online Language Modelling project, where we will train language models on each new Common Crawl snapshot. But what secrets await us in these random-ish internet samples 🕵️? What differs between monthly snapshots? A 🧵:
Another exciting release in the Online Language Modelling project. Our December 2022 RoBERTa/BERT and GPT2 are out!
Like the previous models, they do better on standard benchmarks than the originals.
Dec RoBERTa/BERT:
Dec GPT2:
We're doing it! Started training a bert-base-uncased last night on the latest OLM data. It's not close to completion yet (hasn't even completed an epoch), but here's an initial result. More models coming soon, including causal LMs!
We’re going to do it! We’ll train and release masked and causal language models (e.g. BERT & GPT-2) on new Common Crawl snapshots as they come out! We call this project Online Language Modeling (OLM). What applications or research questions can we enable or help answer? A 🧵:
Podcast episode is out! We discuss:
- Research life at MIT vs. Facebook/Meta AI vs. Hugging Face. (Academia vs. large corporation vs. startup.)
- How we know that
#AIArt
systems don't understand word order
- Where
#ChatGPT
fails
- when making a model bigger makes it worse
Stop by the Winoground poster at
#CVPR
today.
@apsdehal
and I will be there in person! SOTA models (FLAVA, CLIP, etc.) do below chance on our dataset, and authors of closed models like Imagen are too scared to report results. How do we create V&L models w compositional knowledge?
The year is 2028. The Feds have infiltrated Soumith Chintala's secret GPU bunker in Montana that he was using to push open-source weights. GPT-7 align-o-matic™ drones have just found Yann LeCun's cave dwelling. There is no more hope 😢
In 270 days, the Department of Commerce will determine whether they will allow open-weights or not.
if you support open model weights and want something actionable to do, then figure out how to lobby your opinion to them.
New ACL 2022 System Demo paper!
It used to take a lot of technical effort to set up custom AI tasks, evaluate models, and collect crowdworker data with models in-the-loop. We’ve added a new framework to
@DynabenchAI
that aims to help: Dynatask.
1/3
@cunha_tristan
@pfau
No way, my parents named me Tristan Thrush, without realizing that "Tristan Thrush" is the exact name of a bird that lives in Tristan da Cunha. This is absolutely wild.
BERT and GPT-2 are downloaded a whopping 30 million times every month on the
@huggingface
Hub, but they live in the past. They both think that Obama is still president and have never heard of COVID! 😱
According to the statistics I've kept since moving into Stanford last night, 100% of Stanford students are business students.
Or maybe it's just that the business students are the only ones who are willing to make eye contact and introduce themselves to me 😅
Will write up a thread on our metalinguistic self-reference tests for LLMs tomorrow morning. Until then, enjoy this LLaMA looking at itself in a mirror. Does it understand?
I was surprised to learn there is still alpha in browsing arxiv for the latest papers yourself rather than waiting for twitter to surface the good ones.
Here's a couple released today that looked interesting:
I am a Strange Dataset: Metalinguistic Tests for Language Models…
📢 New multimodal benchmark and results on a simple (yet not trivial!) task.
We hope that this dataset will be useful in AI art, visual language modeling, multimodal retrieval, and possibly even mechanistic interpretability.
📢 New short paper preprint with a new multimodal benchmark: ColorSwap!!
Models have issues with color and word order compositionality. It is important
#AIArt
models get it right. I asked for a blue orange and an orange blueberry! Not an orange orange and a blue blueberry! 🧵
I’ve also made the difficult decision to leave the amazing team at Hugging Face. I will take the time between now and the official start of my PhD to travel, exercise, think deeply about research ideas, reach out to potential collaborators, and get some new things started!
Great news! The Online Language Modelling project has it's first model trained by a community member: the amazing
@Muhtasham9
!
It's a TinyBERT trained on the OLM December 2022 pretraining dataset. Want a tiny and up-to-date language model? Check it out:
What do people want more of this fall? According to our internet snapshots, it's drugs 🌿, money 🤑, and romance 😍 (in that order!). We carefully examined internet snapshots for our Online Language Model project (models coming soon!). What are the findings? A 🧵:
Woah Bing Chat can understand images now! I was giddy with excitement to try some Winoground images on it!
I was super surprised to find that even Bing Chat doesn't get the yes/no right, and sometimes doesn't understand at all. 🤯
Maybe it isn't actually using multimodal GPT-4?
Winoground is a simple multimodal eval that requires an understanding of word order. It’s been out for nearly 2 years. Surely GPT4V can do it now, right?
Wrong! It is the best known model, but GPT4V still only gets about 38% on the main metric!!
🙌
@ChengleiSi
@aryaman2020
I saw debates on whether GPT-4V can “solve” compositionality, so I spent my precious Friday afternoon benchmarking it on Winoground.
Tldr: NO it’s still far from solved (GPT-4V 38.0% vs PaLI 28.8% vs MTurk Humans 85.5%).
Colab w/ all results:
🧵(1/n)
Are you not reporting confidence intervals? Then you're part of a PsyOp by the PhD student hivemind in order to publish tons of papers that don't matter.
A bunch of people I trust and respect think the significant, valuable contributions of OpenAI are 95% due to a single person.
Why doesn't MSFT just pay that person a lot of money and give them infinite Azure credits? Problems solved.
For most works, I would actually prefer to submit to arxiv over ACL if I can't make the anonymity deadline.
I might actually do that in the coming days, we will see!
Just learned despite everyone voting down *CL's 🤡-y arxiv embargo policy, it's still firmly in place for ACL 2024. If *CL were a company, the board & leadership wd be fired, the talent wd've left 5 years ago, the common stock wd be worth $0, & WSB wd be taking an interest.
A qualification test for a job that is essentially sampled i.i.d. from the actual job. I've never understood why this isn't the de-facto way to interview everywhere.
Want a job at tiny corp?
Join the discord, get a PR merged, solve a bounty, 12 week internship, full time employee.
No resumes, phone screens, whiteboard coding, hackerrank, references, etc…
Just a demonstration of skill and motivation.
On the other hand, people find this task easy. Human annotators from Amazon Mechanical Turk got 89-93% depending on the metric. Unlike models, humans reliably know what is going on when this tweet says that it has three sentences.
Each example in the dataset consists of two self-referential statements that begin in the same way but have different endings. One is true and one is false. Crucially, the ending flips the truth value of the statement.
Excited about the new Evaluation-on-the-Hub tool on Hugging Face 🤩?
Sad that you couldn’t filter leaderboards by task 🥺?
Now you can 🤯! We’ve released a new feature that allows you see leaderboards for a selected task.
Check it out:
ACL announcement:
"The ACL Executive Committee has voted to significantly change ACL's approach to protecting anonymous peer review. The change is effective immediately. (1/4)
#NLPRoc
How could models stay ⏱️up to date? With so many downstream models standing on the shoulders of these two giants, it’s not easy to change the status quo. How do we capture gradual meaning change + abrupt fact change? These are interesting and poorly understood research questions.
Stoked to share this podcast episode teaser. We talk about some of the most exciting issues to solve in the next generation of AI:
#AIArt
systems don’t understand word order
#ChatGPT
doesn’t know who the president is and makes stuff up
Making models bigger can make them worse
As we continually train new models, a slow form of reinforcement learning may emerge. What actions can we take to help the models improve over time? How do we ensure that the models remember/forget the right things? Can we exploit any concepts from RL research?
We tested open source models from 7B to 70B parameters, including LLaMA 2, Mistral, and Mixtral. We also tested leading API models such as GPT-4 and Claude 2. They all got around chance (50%). GPT-4 is the only one to stay significantly above chance, but not by much (~59-66%).
Let me know when one of these models can beat Winoground, I'm tired of watching all of the big new releases still fail. None of them can even understand word order afaik 🥱😴
I think DALL·E 3 is not just a stance against MidJourney. It's actually a sneak peak of the upcoming, epic battle of massively multimodal LLMs, against DeepMind Gemini.
Quote: "DALL·E 3 is built natively on ChatGPT". This is the key phrase.
DALL·E 3's extraordinary language…
LLMs have had issues with negation for a long time too. It seems that multimodal LLMs have similar problems but are a few years behind their purely-language counterparts. Same with word-order, etc.
In fact, LLMs still don’t reliably know what’s going on when this tweet says that the previous tweet has three sentences. Most are around chance at non-self-referential metalinguistic problems too. Although GPT-4 seems to struggle more with the self-referential framing.
My new favorite test for diffusion models is whether they can generate an image of an orange blueberry riding a blue orange. Horse riding astronaut is so last month.
We found a trend of improvement with scale, but all of the models are still extremely limited. Will this trend continue? How much scale do we need to generalize correctly on metalinguistic self-reference?
Check it out! It turns out that you can just give an LLM captions to get competitive multimodal task performance - it's even better than OpenFlamingo V2 in some cases. Although far from perfect, this is a very strong baseline model!
Announcing LENS 🔎, a framework for vision-augmented language models.
- Outperforms Flamingo by 9% (56->65%) on VQAv2
- Eliminates the additional cost of multimodal pre-training
Demo:
Blog+Paper+Code:
A 🧵 [1/N]
Except of course if we added Winoground to this plot, it would basically look like a flat line around random performance, and it's been that way for a year (sorry I couldn't resist the plug 😅)
Super excited to announce that
@apsdehal
and I have launched a new company:
@ContextualAI
!
Why did we start it? Because LLMs are going to radically change the way enterprises operate, and we see a huge need for LLMs that actually work for enterprise use cases.
1/5
@natolambert
Thanks for the credit 🤣🤣🤣. I just want to chime in and let people know that I have no affiliation with "Waifu Research Department". I just saw it on HF one day when looking for diffusion finetuning examples 💀
TLDR: He posted in a comment below that these images are cherry-picked and DALLE-3 actually doesn't solve this problem reliably. Still cool images though!
Does anyone from an AI art company want to try some new (but simple) training ideas w me? Medium-risk high-reward imo.
@BBacktesting
In my view, if the tokenizer is the problem, then that's interesting too, right? For whatever reason, humans do well and models don't do well, given the same string. So this test might reveal how it is important to change the tokenizer to get human generalization.
To explore this together with the community, we can start by pretraining a model from scratch every time a Common Crawl snapshot comes out, or continuously keep pretraining the same model. But how do we weight the data? And what else should we try?
We introduce several metrics for automatic evaluation. We test models both for their ability to generate true self-referential statements, and validate complete self-referential statements as true or false.
Not sure if this is actually cause and effect. But the stock price maps to my personal confidence pretty well. I left Meta FAIR earlier than I had expected right after they started the Metaverse focus. I lost some faith during that time, but now my faith is very much back!!
How did open source AI change Meta’s stock value?
Welp I think this may speak loudly.
Meta is many things but one thing history will record is they saved AI to be open source.
The markets agree:
What are the implementation details? We’ve open-sourced everything, and we hope you find it easy to use! You can use our tools to pull the latest data from across the web, clean it, and pretrain models.
Data 👉
Training 👉
@JesseDodge
I can think of two ways this could be fine off the top of my head:
1. Human-preferred ai content is naturally upweighted on the internet. Human input remains.
2. True/useful generations are naturally upweighted on the internet (because AI-generated code didn't crash, etc.)
Why is it important? BERT and GPT-2 both live in the past. They think that Obama is still president and have never heard of COVID! To fix this, we need a pretraining dataset that continuously updates.
The task: Given two images and two captions, the goal is to match them correctly—but crucially, both captions contain the same words/morphemes, only in a different order. Identical words between captions means that BOW models cannot perform above chance.
2/5
@bwhite5290
A few directions:
1. How do we get V&L models to beat Winoground?
2. How do we keep our models up-to-date? It would be nice if we could just tell ChatGPT "remember that x is president now".
3. How do we bring large-scale pre-training to the real world, with e.g. robots?
Overall:
If you want to use a more up-to-date BERT, go here 👉
If you want to use a more up-to-date GPT2, go here 👉
Stay tuned for the next models, which will be trained with December data!
Dear
@GoogleAI
,
Three months have passed since you claimed Imagen improved performance on compositionality.
I asked for access and you didn’t respond.
@TristanThrush
offered you his Winoground materials; you didn’t respond.
Why not?
cc
@Chitwan_Saharia
@blaiseaguera
New pretraining dataset, this time from a May 2017 snapshot of the internet:
It is easy to run our pipeline on any Common Crawl snapshot, and the community has expressed interest in comparing text from years ago with the text in our latest datasets.
Looks like we are still in dire need of Online Language Models after trying out
#ChatGPT
! Hopefully our project will lead to insights about how we can update even the largest of models, like this one, effectively and efficiently with new information.
We’re going to do it! We’ll train and release masked and causal language models (e.g. BERT & GPT-2) on new Common Crawl snapshots as they come out! We call this project Online Language Modeling (OLM). What applications or research questions can we enable or help answer? A 🧵:
Today, anyone can select models, datasets, and metrics on the Hugging Face hub and get the evaluation results automatically! Very important feature for practitioners to choose models, for researchers to test a dataset on lots of models, and for reproducibility efforts!
Excited to share a new tool we’ve built called Evaluation on the Hub 🔥🔥🔥!
With this tool you can evaluate any model on any dataset with any metric🤯
Evaluate your models here👉
Let’s take a look at how it works 🧵
1/
At
@MetaAI
we favor publication quality over quantity.
That's why among the 100 most cited AI papers in 2022,
@MetaAI
has authored (or co-authored) 16, ranking 2nd just behind Google with 22.
Our research is having a large impact on the community.
(and NYU ranks nicely, too).
@giffmana
@ChengleiSi
The PaLI numbers actually come from this paper which is authored by people from Google. They even finetuned it on additional data, etc. for an extra Winoground edge, I think:
Progress in AI continues to outpace benchmarks.
Check out this new plot, inspired by
@DynabenchAI
, that shows just how quickly it's happening.
Read more about it here:
New words! It turns out that the world changes a lot in a few months. In our pretraining datasets, we found a reflection of new events that occurred and were amplified over the summer. In the graph below, you can see talk about these terms increasing throughout the summer.
I am bursting with excitement to finally share an idea that has been cooking for awhile: Measuring Data
When you "measure data", you quantify its characteristics to support dataset comparison & curation.
You also begin to know what systems will learn.
📢 New paper!!
Do LLMs understand self-referential statements? Introducing “I am a Strange Dataset”. All tested models perform around chance at our metalinguistic self-reference task.
GPT-4 is the only model significantly above chance on all tests, but it is slight.🧵
We found that all of these models are very poor overall: FLAVA, CLIP, UNITER, ViLLA, VinVL, VisualBERT, ViLT, LXMERT, ViLBERT, UniT, VSE++, and VSRN. Can your model do better?
3/5
Can we know *for sure* that any of these models' generations are truly novel, without manually inspecting every last training image? These models could be even less compositional than we thought ()
Models such as Stable Diffusion are trained on copyrighted, trademarked, private, and sensitive images.
Yet, our new paper shows that diffusion models memorize images from their training data and emit them at generation time.
Paper:
👇[1/9]
ChatGPT: Sorry, I can't draw copyrighted characters like Sonic the Hedgehog.
Also ChatGPT: Wow, Sonic the Hedgehog sounds like a fun and original character!
Why is this project important? BERT and GPT2 are still two of the most downloaded models on the Hugging Face Hub, but they have no idea what COVID is or who the current president is. We can take a step towards fixing this by re-training them on new data continuously.
@max_nlp
@GaryMarcus
@raphaelmilliere
Yes, it can be run on Winoground prompts and evaluated with annotators in a way that is similar to what they did with DrawBench.
@Chitwan_Saharia
and coauthors can reach out if they need help understanding the setup. Winoground is available for them to use whenever they want.
The dataset was hand-curated by a group of expert annotators and validated by crowdworkers. To assist in analyzing model performance, the annotators tagged examples from a set of 70 fine-grained linguistic tags, 5 coarse linguistic tags, and 3 visual tags.
4/5
@CoexistWithAI
is a new AI podcast which tries to make the conversations understandable by the general public. Ya'll should follow them! Stay tuned for the episode
📢 New paper!!
Do LLMs understand self-referential statements? Introducing “I am a Strange Dataset”. All tested models perform around chance at our metalinguistic self-reference task.
GPT-4 is the only model significantly above chance on all tests, but it is slight.🧵
@giffmana
@natolambert
Ok I will do this just for you: if an undergrad wants to take this on, I will give them some advice about how to make a cool waifu-related benchmark
There is an episode of Parks & Rec that is actually about AI research. Ron and Chris have a cooking competition. Chris spends all day crafting a fancy custom sandwich. Ron buys hamburger meat from a convenience store. The hamburger wins. Often, in AI, the hamburger wins.
@SashaMTL
@emilymbender
@annargrs
@alkoller
To see that it isn't solved, an example that I like to use is this:
Ask a diffusion model to generate "two forks and three spoons" versus "three forks and two spoons".
The models might get examples like this correct sometimes, but as far as I know, they still aren't reliable.