Tristan Thrush Profile Banner
Tristan Thrush Profile
Tristan Thrush

@TristanThrush

2,717
Followers
765
Following
49
Media
452
Statuses

PhD-ing @StanfordAILab @stanfordnlp . Advisor @PlaytestAI . Past: @ContextualAI , @huggingface , @Meta FAIR, @mitbrainandcog , @MIT_CSAIL , @NASAJPL

Joined April 2021
Don't wanna be here? Send us removal request.
Pinned Tweet
@TristanThrush
Tristan Thrush
4 months
📢 New paper!! Do LLMs understand self-referential statements? Introducing “I am a Strange Dataset”. All tested models perform around chance at our metalinguistic self-reference task. GPT-4 is the only model significantly above chance on all tests, but it is slight.🧵
Tweet media one
22
72
475
@TristanThrush
Tristan Thrush
11 months
GPT-4 after taking MIT tests and self-scoring it's answers as 100%
Tweet media one
10
82
1K
@TristanThrush
Tristan Thrush
1 year
I'm excited to announce that we've added the very first @OpenAI human-feedback dataset to the Hugging Face Hub! Check it out if you have interest in #ChatGPT and Reinforcement Learning from Human Feedback. The dataset is from the awesome WebGPT paper.
5
140
662
@TristanThrush
Tristan Thrush
1 year
Life update: I’ve decided to join Stanford as a PhD student. Beyond happy for the chance to collaborate closely with the incredible researchers in the NLP group and broader AI lab!!!
23
16
597
@TristanThrush
Tristan Thrush
2 years
We’re going to do it! We’ll train and release masked and causal language models (e.g. BERT & GPT-2) on new Common Crawl snapshots as they come out! We call this project Online Language Modeling (OLM). What applications or research questions can we enable or help answer? A 🧵:
10
62
498
@TristanThrush
Tristan Thrush
1 year
Super excited to say that our Online Language Model project has reached a huge milestone! We are now releasing a RoBERTa/BERT and GPT2 trained on up-to-date data, every month or so. But how do they do on standard benchmarks? Typically, better than the originals! A 🧵…
Tweet media one
Tweet media two
5
53
282
@TristanThrush
Tristan Thrush
2 years
For our Online Language Modelling (OLM) project, we’ve open-sourced end-to-end code to turn the latest Common Crawl and Wikipedia web snapshots into clean datasets for pretraining models like BERT and GPT-2: . What are the details? A 🧵:
1
50
266
@TristanThrush
Tristan Thrush
2 years
Happy to announce our new CVPR paper - Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality. All tested SOTA multimodal models perform very poorly on our new vision-language eval dataset. Paper: #CVPR2022 , #NLProc 1/5
8
44
247
@TristanThrush
Tristan Thrush
1 year
A new @OpenAI human-feedback dataset is on the Hugging Face Hub! This one is from the "Learning to Summarize from Human Feedback" paper, where the amazing authors trained an RLHF reward model for summarization.
1
40
217
@TristanThrush
Tristan Thrush
2 years
We released initial pretraining datasets for the Online Language Modelling project, where we will train language models on each new Common Crawl snapshot. But what secrets await us in these random-ish internet samples 🕵️? What differs between monthly snapshots? A 🧵:
Tweet media one
2
19
137
@TristanThrush
Tristan Thrush
1 year
Another exciting release in the Online Language Modelling project. Our December 2022 RoBERTa/BERT and GPT2 are out! Like the previous models, they do better on standard benchmarks than the originals. Dec RoBERTa/BERT: Dec GPT2:
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
21
122
@TristanThrush
Tristan Thrush
1 year
We're doing it! Started training a bert-base-uncased last night on the latest OLM data. It's not close to completion yet (hasn't even completed an epoch), but here's an initial result. More models coming soon, including causal LMs!
Tweet media one
@TristanThrush
Tristan Thrush
2 years
We’re going to do it! We’ll train and release masked and causal language models (e.g. BERT & GPT-2) on new Common Crawl snapshots as they come out! We call this project Online Language Modeling (OLM). What applications or research questions can we enable or help answer? A 🧵:
10
62
498
7
16
89
@TristanThrush
Tristan Thrush
1 year
Podcast episode is out! We discuss: - Research life at MIT vs. Facebook/Meta AI vs. Hugging Face. (Academia vs. large corporation vs. startup.) - How we know that #AIArt systems don't understand word order - Where #ChatGPT fails - when making a model bigger makes it worse
@CoexistWithAI
Coexisting with AI
1 year
It's the episode you've been waiting for! Hugging Face Engineer Tristan Thrush is here to talk about how AI ability is *not* close to human ability. Simplecast: #AiResearch #AiLies #AiChallenges #artificialIntelligence #AI #machineLearning #ML #podcast
1
3
12
0
9
70
@TristanThrush
Tristan Thrush
2 years
Stop by the Winoground poster at #CVPR today. @apsdehal and I will be there in person! SOTA models (FLAVA, CLIP, etc.) do below chance on our dataset, and authors of closed models like Imagen are too scared to report results. How do we create V&L models w compositional knowledge?
Tweet media one
1
7
62
@TristanThrush
Tristan Thrush
6 months
The year is 2028. The Feds have infiltrated Soumith Chintala's secret GPU bunker in Montana that he was using to push open-source weights. GPT-7 align-o-matic™ drones have just found Yann LeCun's cave dwelling. There is no more hope 😢
@soumithchintala
Soumith Chintala
6 months
In 270 days, the Department of Commerce will determine whether they will allow open-weights or not. if you support open model weights and want something actionable to do, then figure out how to lobby your opinion to them.
Tweet media one
35
169
718
0
2
48
@TristanThrush
Tristan Thrush
2 years
New ACL 2022 System Demo paper! It used to take a lot of technical effort to set up custom AI tasks, evaluate models, and collect crowdworker data with models in-the-loop. We’ve added a new framework to @DynabenchAI that aims to help: Dynatask. 1/3
2
12
47
@TristanThrush
Tristan Thrush
5 months
@cunha_tristan @pfau No way, my parents named me Tristan Thrush, without realizing that "Tristan Thrush" is the exact name of a bird that lives in Tristan da Cunha. This is absolutely wild.
3
0
46
@TristanThrush
Tristan Thrush
2 years
BERT and GPT-2 are downloaded a whopping 30 million times every month on the @huggingface Hub, but they live in the past. They both think that Obama is still president and have never heard of COVID! 😱
Tweet media one
Tweet media two
2
6
45
@TristanThrush
Tristan Thrush
8 months
According to the statistics I've kept since moving into Stanford last night, 100% of Stanford students are business students. Or maybe it's just that the business students are the only ones who are willing to make eye contact and introduce themselves to me 😅
1
2
43
@TristanThrush
Tristan Thrush
4 months
Will write up a thread on our metalinguistic self-reference tests for LLMs tomorrow morning. Until then, enjoy this LLaMA looking at itself in a mirror. Does it understand?
Tweet media one
@Teknium1
Teknium (e/λ)
4 months
I was surprised to learn there is still alpha in browsing arxiv for the latest papers yourself rather than waiting for twitter to surface the good ones. Here's a couple released today that looked interesting: I am a Strange Dataset: Metalinguistic Tests for Language Models…
Tweet media one
Tweet media two
18
24
286
2
3
38
@TristanThrush
Tristan Thrush
3 months
📢 New multimodal benchmark and results on a simple (yet not trivial!) task. We hope that this dataset will be useful in AI art, visual language modeling, multimodal retrieval, and possibly even mechanistic interpretability.
@top34051
Top Burapacheep
3 months
📢 New short paper preprint with a new multimodal benchmark: ColorSwap!! Models have issues with color and word order compositionality. It is important #AIArt models get it right. I asked for a blue orange and an orange blueberry! Not an orange orange and a blue blueberry! 🧵
Tweet media one
2
8
34
1
4
37
@TristanThrush
Tristan Thrush
1 year
I’ve also made the difficult decision to leave the amazing team at Hugging Face. I will take the time between now and the official start of my PhD to travel, exercise, think deeply about research ideas, reach out to potential collaborators, and get some new things started!
1
0
37
@TristanThrush
Tristan Thrush
1 year
Great news! The Online Language Modelling project has it's first model trained by a community member: the amazing @Muhtasham9 ! It's a TinyBERT trained on the OLM December 2022 pretraining dataset. Want a tiny and up-to-date language model? Check it out:
0
2
34
@TristanThrush
Tristan Thrush
2 years
What do people want more of this fall? According to our internet snapshots, it's drugs 🌿, money 🤑, and romance 😍 (in that order!). We carefully examined internet snapshots for our Online Language Model project (models coming soon!). What are the findings? A 🧵:
Tweet media one
2
6
34
@TristanThrush
Tristan Thrush
11 months
Woah Bing Chat can understand images now! I was giddy with excitement to try some Winoground images on it! I was super surprised to find that even Bing Chat doesn't get the yes/no right, and sometimes doesn't understand at all. 🤯 Maybe it isn't actually using multimodal GPT-4?
Tweet media one
Tweet media two
Tweet media three
4
3
34
@TristanThrush
Tristan Thrush
5 months
Winoground is a simple multimodal eval that requires an understanding of word order. It’s been out for nearly 2 years. Surely GPT4V can do it now, right? Wrong! It is the best known model, but GPT4V still only gets about 38% on the main metric!! 🙌 @ChengleiSi @aryaman2020
@ChengleiSi
CLS
5 months
I saw debates on whether GPT-4V can “solve” compositionality, so I spent my precious Friday afternoon benchmarking it on Winoground. Tldr: NO it’s still far from solved (GPT-4V 38.0% vs PaLI 28.8% vs MTurk Humans 85.5%). Colab w/ all results: 🧵(1/n)
7
49
334
1
3
33
@TristanThrush
Tristan Thrush
1 month
Are you not reporting confidence intervals? Then you're part of a PsyOp by the PhD student hivemind in order to publish tons of papers that don't matter.
4
1
32
@TristanThrush
Tristan Thrush
6 months
Whosoever holds the Radford possesses the power of AGI
@beenwrekt
Ben Recht
6 months
A bunch of people I trust and respect think the significant, valuable contributions of OpenAI are 95% due to a single person. Why doesn't MSFT just pay that person a lot of money and give them infinite Azure credits? Problems solved.
9
6
104
1
1
29
@TristanThrush
Tristan Thrush
4 months
For most works, I would actually prefer to submit to arxiv over ACL if I can't make the anonymity deadline. I might actually do that in the coming days, we will see!
@zacharylipton
Zachary Lipton
4 months
Just learned despite everyone voting down *CL's 🤡-y arxiv embargo policy, it's still firmly in place for ACL 2024. If *CL were a company, the board & leadership wd be fired, the talent wd've left 5 years ago, the common stock wd be worth $0, & WSB wd be taking an interest.
10
13
127
0
1
30
@TristanThrush
Tristan Thrush
1 year
Several mentors had a really profound impact on my research career. It takes a village. @douwekiela , @adinamwilliams , @roger_p_levy , @candacerossio , @renauddetry , @mmitchell_ai , Patrick Winston, Russ Tedrake, Josh Tenenbaum, plus many more. Thank you so much!
2
0
30
@TristanThrush
Tristan Thrush
6 months
A qualification test for a job that is essentially sampled i.i.d. from the actual job. I've never understood why this isn't the de-facto way to interview everywhere.
@__tinygrad__
the tiny corp
6 months
Want a job at tiny corp? Join the discord, get a PR merged, solve a bounty, 12 week internship, full time employee. No resumes, phone screens, whiteboard coding, hackerrank, references, etc… Just a demonstration of skill and motivation.
25
89
1K
2
1
26
@TristanThrush
Tristan Thrush
2 years
Of course, we’ll open source everything we do and make our tools available to others. Wanna get involved? Reach out, and join the discord for updates:
1
2
28
@TristanThrush
Tristan Thrush
4 months
On the other hand, people find this task easy. Human annotators from Amazon Mechanical Turk got 89-93% depending on the metric. Unlike models, humans reliably know what is going on when this tweet says that it has three sentences.
1
0
27
@TristanThrush
Tristan Thrush
4 months
Each example in the dataset consists of two self-referential statements that begin in the same way but have different endings. One is true and one is false. Crucially, the ending flips the truth value of the statement.
Tweet media one
2
0
24
@TristanThrush
Tristan Thrush
2 years
Excited about the new Evaluation-on-the-Hub tool on Hugging Face 🤩? Sad that you couldn’t filter leaderboards by task 🥺? Now you can 🤯! We’ve released a new feature that allows you see leaderboards for a selected task. Check it out:
Tweet media one
0
6
26
@TristanThrush
Tristan Thrush
4 months
It turns out I actually could've enjoyed the holidays 😭
@aclmeeting
ACL 2024
4 months
ACL announcement: "The ACL Executive Committee has voted to significantly change ACL's approach to protecting anonymous peer review. The change is effective immediately. (1/4) #NLPRoc
5
189
443
2
0
26
@TristanThrush
Tristan Thrush
2 years
How could models stay ⏱️up to date? With so many downstream models standing on the shoulders of these two giants, it’s not easy to change the status quo. How do we capture gradual meaning change + abrupt fact change? These are interesting and poorly understood research questions.
Tweet media one
1
1
23
@TristanThrush
Tristan Thrush
5 months
The sunflower I'm growing will never understand how pretty it is. I wonder what kinds of things might look at us and feel the same way?
Tweet media one
1
0
23
@TristanThrush
Tristan Thrush
1 year
@shaily99 Yes I don't think I need a PhD to do NLP research. But have you seen all of the cool people at Stanford?
1
0
22
@TristanThrush
Tristan Thrush
1 year
Stoked to share this podcast episode teaser. We talk about some of the most exciting issues to solve in the next generation of AI: #AIArt systems don’t understand word order #ChatGPT doesn’t know who the president is and makes stuff up Making models bigger can make them worse
@CoexistWithAI
Coexisting with AI
1 year
Our next episode will change how you think about the future of AI. Tristan Thrush, a booming innovative force in AI research, will be talking about the critical AI challenges. #AiLies #AiResearch #AiChallenges #artificialIntelligence #machineLearning
0
2
5
3
3
21
@TristanThrush
Tristan Thrush
2 years
As we continually train new models, a slow form of reinforcement learning may emerge. What actions can we take to help the models improve over time? How do we ensure that the models remember/forget the right things? Can we exploit any concepts from RL research?
2
0
21
@TristanThrush
Tristan Thrush
4 months
We tested open source models from 7B to 70B parameters, including LLaMA 2, Mistral, and Mixtral. We also tested leading API models such as GPT-4 and Claude 2. They all got around chance (50%). GPT-4 is the only one to stay significantly above chance, but not by much (~59-66%).
Tweet media one
1
0
21
@TristanThrush
Tristan Thrush
8 months
Let me know when one of these models can beat Winoground, I'm tired of watching all of the big new releases still fail. None of them can even understand word order afaik 🥱😴
@DrJimFan
Jim Fan
8 months
I think DALL·E 3 is not just a stance against MidJourney. It's actually a sneak peak of the upcoming, epic battle of massively multimodal LLMs, against DeepMind Gemini. Quote: "DALL·E 3 is built natively on ChatGPT". This is the key phrase. DALL·E 3's extraordinary language…
Tweet media one
96
384
3K
1
0
20
@TristanThrush
Tristan Thrush
4 months
LLMs have had issues with negation for a long time too. It seems that multimodal LLMs have similar problems but are a few years behind their purely-language counterparts. Same with word-order, etc.
@FrankyBallarani
Franky Ballarani — e/acc
4 months
@GaryMarcus Def not c3po
Tweet media one
Tweet media two
3
1
8
1
1
19
@TristanThrush
Tristan Thrush
4 months
In fact, LLMs still don’t reliably know what’s going on when this tweet says that the previous tweet has three sentences. Most are around chance at non-self-referential metalinguistic problems too. Although GPT-4 seems to struggle more with the self-referential framing.
1
0
20
@TristanThrush
Tristan Thrush
6 months
My new favorite test for diffusion models is whether they can generate an image of an orange blueberry riding a blue orange. Horse riding astronaut is so last month.
2
0
19
@TristanThrush
Tristan Thrush
1 year
Thanks to all of the authors of WebGPT and to @natolambert for reaching out to them and organizing!
1
1
19
@TristanThrush
Tristan Thrush
4 months
We found a trend of improvement with scale, but all of the models are still extremely limited. Will this trend continue? How much scale do we need to generalize correctly on metalinguistic self-reference?
Tweet media one
1
0
17
@TristanThrush
Tristan Thrush
11 months
Check it out! It turns out that you can just give an LLM captions to get competitive multimodal task performance - it's even better than OpenFlamingo V2 in some cases. Although far from perfect, this is a very strong baseline model!
@w33lliam
William Berrios
11 months
Announcing LENS 🔎, a framework for vision-augmented language models. - Outperforms Flamingo by 9% (56->65%) on VQAv2 - Eliminates the additional cost of multimodal pre-training Demo: Blog+Paper+Code: A 🧵 [1/N]
2
42
192
0
0
15
@TristanThrush
Tristan Thrush
9 months
Except of course if we added Winoground to this plot, it would basically look like a flat line around random performance, and it's been that way for a year (sorry I couldn't resist the plug 😅)
1
2
15
@TristanThrush
Tristan Thrush
11 months
Finally able to talk about how excited I am to help with this!!!
@douwekiela
Douwe Kiela
11 months
Super excited to announce that @apsdehal and I have launched a new company: @ContextualAI ! Why did we start it? Because LLMs are going to radically change the way enterprises operate, and we see a huge need for LLMs that actually work for enterprise use cases. 1/5
38
39
427
1
1
14
@TristanThrush
Tristan Thrush
4 months
@natolambert Thanks for the credit 🤣🤣🤣. I just want to chime in and let people know that I have no affiliation with "Waifu Research Department". I just saw it on HF one day when looking for diffusion finetuning examples 💀
3
0
13
@TristanThrush
Tristan Thrush
8 months
TLDR: He posted in a comment below that these images are cherry-picked and DALLE-3 actually doesn't solve this problem reliably. Still cool images though! Does anyone from an AI art company want to try some new (but simple) training ideas w me? Medium-risk high-reward imo.
@willdepue
will depue
8 months
DALLE-3 solves “horse riding astronaut” prompt for everyone that’s asking.
Tweet media one
Tweet media two
Tweet media three
86
160
2K
0
1
13
@TristanThrush
Tristan Thrush
4 months
@BBacktesting In my view, if the tokenizer is the problem, then that's interesting too, right? For whatever reason, humans do well and models don't do well, given the same string. So this test might reveal how it is important to change the tokenizer to get human generalization.
3
0
13
@TristanThrush
Tristan Thrush
2 years
To explore this together with the community, we can start by pretraining a model from scratch every time a Common Crawl snapshot comes out, or continuously keep pretraining the same model. But how do we weight the data? And what else should we try?
2
0
13
@TristanThrush
Tristan Thrush
4 months
We introduce several metrics for automatic evaluation. We test models both for their ability to generate true self-referential statements, and validate complete self-referential statements as true or false.
1
0
13
@TristanThrush
Tristan Thrush
3 months
Not sure if this is actually cause and effect. But the stock price maps to my personal confidence pretty well. I left Meta FAIR earlier than I had expected right after they started the Metaverse focus. I lost some faith during that time, but now my faith is very much back!!
@BrianRoemmele
Brian Roemmele
3 months
How did open source AI change Meta’s stock value? Welp I think this may speak loudly. Meta is many things but one thing history will record is they saved AI to be open source. The markets agree:
Tweet media one
47
100
586
1
0
12
@TristanThrush
Tristan Thrush
1 year
What are the implementation details? We’ve open-sourced everything, and we hope you find it easy to use! You can use our tools to pull the latest data from across the web, clean it, and pretrain models. Data 👉 Training 👉
1
2
12
@TristanThrush
Tristan Thrush
2 months
@JesseDodge I can think of two ways this could be fine off the top of my head: 1. Human-preferred ai content is naturally upweighted on the internet. Human input remains. 2. True/useful generations are naturally upweighted on the internet (because AI-generated code didn't crash, etc.)
0
0
12
@TristanThrush
Tristan Thrush
10 months
@WenhuChen If only first author, then I don't think that even Karpathy meets this qualification 😂
0
0
11
@TristanThrush
Tristan Thrush
1 year
@0x_y4c0 @OpenAI I think that this blog post is a great intro to RLHF:
1
1
11
@TristanThrush
Tristan Thrush
2 years
Why is it important? BERT and GPT-2 both live in the past. They think that Obama is still president and have never heard of COVID! To fix this, we need a pretraining dataset that continuously updates.
Tweet media one
Tweet media two
1
0
10
@TristanThrush
Tristan Thrush
2 years
The task: Given two images and two captions, the goal is to match them correctly—but crucially, both captions contain the same words/morphemes, only in a different order. Identical words between captions means that BOW models cannot perform above chance. 2/5
Tweet media one
1
1
10
@TristanThrush
Tristan Thrush
1 year
@bwhite5290 A few directions: 1. How do we get V&L models to beat Winoground? 2. How do we keep our models up-to-date? It would be nice if we could just tell ChatGPT "remember that x is president now". 3. How do we bring large-scale pre-training to the real world, with e.g. robots?
3
0
10
@TristanThrush
Tristan Thrush
1 year
Overall: If you want to use a more up-to-date BERT, go here 👉 If you want to use a more up-to-date GPT2, go here 👉 Stay tuned for the next models, which will be trained with December data!
1
2
10
@TristanThrush
Tristan Thrush
7 months
For some reason, I really like Mistral's pixelated "M" logo. The design of French AI company logos continues to be 10/10
0
0
9
@TristanThrush
Tristan Thrush
2 years
Also curious about this @GoogleAI . Does your model stand a chance against Winoground?
@GaryMarcus
Gary Marcus
2 years
Dear @GoogleAI , Three months have passed since you claimed Imagen improved performance on compositionality. I asked for access and you didn’t respond. @TristanThrush offered you his Winoground materials; you didn’t respond. Why not? cc @Chitwan_Saharia @blaiseaguera
2
2
21
1
4
9
@TristanThrush
Tristan Thrush
2 years
New pretraining dataset, this time from a May 2017 snapshot of the internet: It is easy to run our pipeline on any Common Crawl snapshot, and the community has expressed interest in comparing text from years ago with the text in our latest datasets.
0
0
9
@TristanThrush
Tristan Thrush
1 year
Looks like we are still in dire need of Online Language Models after trying out #ChatGPT ! Hopefully our project will lead to insights about how we can update even the largest of models, like this one, effectively and efficiently with new information.
Tweet media one
@TristanThrush
Tristan Thrush
2 years
We’re going to do it! We’ll train and release masked and causal language models (e.g. BERT & GPT-2) on new Common Crawl snapshots as they come out! We call this project Online Language Modeling (OLM). What applications or research questions can we enable or help answer? A 🧵:
10
62
498
0
1
9
@TristanThrush
Tristan Thrush
4 months
@daprofile @ESYudkowsky Very well, crowdworkers are at ~90% with essentially the same prompts that we give the models
0
1
9
@TristanThrush
Tristan Thrush
2 months
I bet it isn't even aware of it's own outputs
@tegmark
Max Tegmark
2 months
To what extent is the new Claude3 AI self-aware?
189
40
340
0
0
8
@TristanThrush
Tristan Thrush
2 years
@GaryMarcus @raphaelmilliere The model can be run on the prompts from Winoground right now, @Chitwan_Saharia . The dataset is available on Hugging Face:
0
0
8
@TristanThrush
Tristan Thrush
2 years
Today, anyone can select models, datasets, and metrics on the Hugging Face hub and get the evaluation results automatically! Very important feature for practitioners to choose models, for researchers to test a dataset on lots of models, and for reproducibility efforts!
@_lewtun
Lewis Tunstall
2 years
Excited to share a new tool we’ve built called Evaluation on the Hub 🔥🔥🔥! With this tool you can evaluate any model on any dataset with any metric🤯 Evaluate your models here👉 Let’s take a look at how it works 🧵 1/
Tweet media one
7
81
299
0
0
8
@TristanThrush
Tristan Thrush
4 months
Tweet media one
1
2
8
@TristanThrush
Tristan Thrush
2 years
Our 2022 summer Common Crawl OLM datasets are right here, and we invite you to do your own analysis and tell us what you find! August: June/July: May:
1
2
8
@TristanThrush
Tristan Thrush
1 year
Woah excited to see Winoground in the top 100 most cited AI papers of 2022!!
@ylecun
Yann LeCun
1 year
At @MetaAI we favor publication quality over quantity. That's why among the 100 most cited AI papers in 2022, @MetaAI has authored (or co-authored) 16, ranking 2nd just behind Google with 22. Our research is having a large impact on the community. (and NYU ranks nicely, too).
32
58
684
0
2
7
@TristanThrush
Tristan Thrush
5 months
@giffmana @ChengleiSi The PaLI numbers actually come from this paper which is authored by people from Google. They even finetuned it on additional data, etc. for an extra Winoground edge, I think:
1
0
6
@TristanThrush
Tristan Thrush
9 months
Benchmarks continue to saturate even more quickly 🤯.
@douwekiela
Douwe Kiela
9 months
Progress in AI continues to outpace benchmarks. Check out this new plot, inspired by @DynabenchAI , that shows just how quickly it's happening. Read more about it here:
Tweet media one
6
28
116
1
1
7
@TristanThrush
Tristan Thrush
2 months
@khoomeik This is a really cool direction - I'm glad I get to say that I knew you before you became one of the godfathers of scaling :D
1
0
7
@TristanThrush
Tristan Thrush
2 years
New words! It turns out that the world changes a lot in a few months. In our pretraining datasets, we found a reflection of new events that occurred and were amplified over the summer. In the graph below, you can see talk about these terms increasing throughout the summer.
Tweet media one
1
1
7
@TristanThrush
Tristan Thrush
1 year
Thank you for all of your leadership on this. Very important paper and I am happy that I got to contribute a bit!
@mmitchell_ai
MMitchell
1 year
I am bursting with excitement to finally share an idea that has been cooking for awhile: Measuring Data When you "measure data", you quantify its characteristics to support dataset comparison & curation. You also begin to know what systems will learn.
18
122
665
0
0
7
@TristanThrush
Tristan Thrush
2 months
@GaryMarcus There's a benchmark for this stuff!
@TristanThrush
Tristan Thrush
4 months
📢 New paper!! Do LLMs understand self-referential statements? Introducing “I am a Strange Dataset”. All tested models perform around chance at our metalinguistic self-reference task. GPT-4 is the only model significantly above chance on all tests, but it is slight.🧵
Tweet media one
22
72
475
0
1
7
@TristanThrush
Tristan Thrush
2 years
We found that all of these models are very poor overall: FLAVA, CLIP, UNITER, ViLLA, VinVL, VisualBERT, ViLT, LXMERT, ViLBERT, UniT, VSE++, and VSRN. Can your model do better? 3/5
1
0
7
@TristanThrush
Tristan Thrush
1 year
Can we know *for sure* that any of these models' generations are truly novel, without manually inspecting every last training image? These models could be even less compositional than we thought ()
@Eric_Wallace_
Eric Wallace
1 year
Models such as Stable Diffusion are trained on copyrighted, trademarked, private, and sensitive images. Yet, our new paper shows that diffusion models memorize images from their training data and emit them at generation time. Paper: 👇[1/9]
Tweet media one
174
2K
10K
1
0
7
@TristanThrush
Tristan Thrush
4 months
@JulieKallini
Julie Kallini ✨
5 months
ChatGPT: Sorry, I can't draw copyrighted characters like Sonic the Hedgehog. Also ChatGPT: Wow, Sonic the Hedgehog sounds like a fun and original character!
Tweet media one
Tweet media two
3
22
105
0
0
7
@TristanThrush
Tristan Thrush
1 year
Why is this project important? BERT and GPT2 are still two of the most downloaded models on the Hugging Face Hub, but they have no idea what COVID is or who the current president is. We can take a step towards fixing this by re-training them on new data continuously.
2
0
7
@TristanThrush
Tristan Thrush
2 years
@max_nlp @GaryMarcus @raphaelmilliere Yes, it can be run on Winoground prompts and evaluated with annotators in a way that is similar to what they did with DrawBench. @Chitwan_Saharia and coauthors can reach out if they need help understanding the setup. Winoground is available for them to use whenever they want.
0
0
7
@TristanThrush
Tristan Thrush
2 years
The dataset was hand-curated by a group of expert annotators and validated by crowdworkers. To assist in analyzing model performance, the annotators tagged examples from a set of 70 fine-grained linguistic tags, 5 coarse linguistic tags, and 3 visual tags. 4/5
Tweet media one
1
0
7
@TristanThrush
Tristan Thrush
1 year
@CoexistWithAI is a new AI podcast which tries to make the conversations understandable by the general public. Ya'll should follow them! Stay tuned for the episode
0
1
6
@TristanThrush
Tristan Thrush
4 months
@ESYudkowsky Do you think that LLMs can truly be self-aware if they aren't even aware of their own outputs?
@TristanThrush
Tristan Thrush
4 months
📢 New paper!! Do LLMs understand self-referential statements? Introducing “I am a Strange Dataset”. All tested models perform around chance at our metalinguistic self-reference task. GPT-4 is the only model significantly above chance on all tests, but it is slight.🧵
Tweet media one
22
72
475
1
0
5
@TristanThrush
Tristan Thrush
4 months
@aryaman2020 Chomsky's been real quiet since this dropped
1
0
6
@TristanThrush
Tristan Thrush
3 months
@giffmana @natolambert Ok I will do this just for you: if an undergrad wants to take this on, I will give them some advice about how to make a cool waifu-related benchmark
2
0
6
@TristanThrush
Tristan Thrush
1 year
Of course, the original bert-base-uncased doesn't know what's going on.
Tweet media one
0
0
6
@TristanThrush
Tristan Thrush
5 months
I like this analogy. Seems very sage, especially coming from an academic.
@ChrisGPotts
Christopher Potts
5 months
There is an episode of Parks & Rec that is actually about AI research. Ron and Chris have a cooking competition. Chris spends all day crafting a fancy custom sandwich. Ron buys hamburger meat from a convenience store. The hamburger wins. Often, in AI, the hamburger wins.
6
9
118
0
0
6
@TristanThrush
Tristan Thrush
11 months
@SashaMTL @emilymbender @annargrs @alkoller To see that it isn't solved, an example that I like to use is this: Ask a diffusion model to generate "two forks and three spoons" versus "three forks and two spoons". The models might get examples like this correct sometimes, but as far as I know, they still aren't reliable.
2
0
6
@TristanThrush
Tristan Thrush
4 months
@giffmana Haha nice I'm honored 🙏
0
0
5
@TristanThrush
Tristan Thrush
2 years
Thanks for the incredible teamwork! @this_is_ryanj , @max_nlp , @apsdehal , @adinamwilliams , @douwekiela , @candacerossio . We hope this dataset and task for visio-linguistic compositional understanding will contribute towards the development of truly grounded multimodal models! 5/5
2
0
5