Tristan Thrush @TristanThrush Twitter profile | Pikagi

Pikagi

Tristan Thrush

@TristanThrush

2,717

Followers

765

Following

49

Media

452

Statuses

PhD-ing @StanfordAILab @stanfordnlp . Advisor @PlaytestAI . Past: @ContextualAI , @huggingface , @Meta FAIR, @mitbrainandcog , @MIT_CSAIL , @NASAJPL

https://t.co/nf7NgOnfDT

Joined April 2021

Don't wanna be here? Send us removal request.

Pinned Tweet

@TristanThrush

Tristan Thrush

4 months

📢 New paper!! Do LLMs understand self-referential statements? Introducing “I am a Strange Dataset”. All tested models perform around chance at our metalinguistic self-reference task. GPT-4 is the only model significantly above chance on all tests, but it is slight.🧵

Tweet media one

22

72

475

Last Seen Profiles

@earlofhampton

@YESBANK

@RaEndH0gMZ97234

@cryyybbyyy777

@CandicePet21928

@Hamlet18160

@FujiiYasun32664

@mariajadoon15

@edwinsambranov

@yuzuakarui

@if0o9

@NHS_BerksWest

@smacafrica

@hirasauna

@MSUMarchingBand

@misa_dc5

@HeidiGuestHouse

@TonyMazur

@GemDelano

@AaronImholte

@BunBTrillOG

@KevinSmithAF

@aralovebts2

@FurmanLola93054

@TheGoat1421

@DrDaveTerry

@PrivilegeMaster

@Ugovhb

@powerfmuganda

@keklumpp

@HAWRAMOSA

@a85963643

@slipknotinelle

@willchamberlain

@sorcerers_claw

@thatonefrogxd

@TristanThrush

Tristan Thrush

11 months

GPT-4 after taking MIT tests and self-scoring it's answers as 100%

Tweet media one

10

82

1K

@TristanThrush

Tristan Thrush

1 year

I'm excited to announce that we've added the very first @OpenAI human-feedback dataset to the Hugging Face Hub! Check it out if you have interest in #ChatGPT and Reinforcement Learning from Human Feedback. The dataset is from the awesome WebGPT paper.

Tweet card media

openai/webgpt_comparisons · Datasets at Hugging Face

5

140

662

@TristanThrush

Tristan Thrush

1 year

Life update: I’ve decided to join Stanford as a PhD student. Beyond happy for the chance to collaborate closely with the incredible researchers in the NLP group and broader AI lab!!!

23

16

597

@TristanThrush

Tristan Thrush

2 years

We’re going to do it! We’ll train and release masked and causal language models (e.g. BERT & GPT-2) on new Common Crawl snapshots as they come out! We call this project Online Language Modeling (OLM). What applications or research questions can we enable or help answer? A 🧵:

10

62

498

@TristanThrush

Tristan Thrush

1 year

Super excited to say that our Online Language Model project has reached a huge milestone! We are now releasing a RoBERTa/BERT and GPT2 trained on up-to-date data, every month or so. But how do they do on standard benchmarks? Typically, better than the originals! A 🧵…

Tweet media one

Tweet media two

5

53

282

@TristanThrush

Tristan Thrush

2 years

For our Online Language Modelling (OLM) project, we’ve open-sourced end-to-end code to turn the latest Common Crawl and Wikipedia web snapshots into clean datasets for pretraining models like BERT and GPT-2: . What are the details? A 🧵:

Tweet card media

GitHub - huggingface/olm-datasets: Pipeline for pulling and processing online language model...

Pipeline for pulling and processing online language model pretraining data from the web - huggingface/olm-datasets

1

50

266

@TristanThrush

Tristan Thrush

2 years

Happy to announce our new CVPR paper - Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality. All tested SOTA multimodal models perform very poorly on our new vision-language eval dataset. Paper: #CVPR2022 , #NLProc 1/5

Tweet card media

Winoground: Probing Vision and Language Models for...

We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two...

8

44

247

@TristanThrush

Tristan Thrush

1 year

A new @OpenAI human-feedback dataset is on the Hugging Face Hub! This one is from the "Learning to Summarize from Human Feedback" paper, where the amazing authors trained an RLHF reward model for summarization.

Tweet card media

openai/summarize_from_feedback · Datasets at Hugging Face

1

40

217

@TristanThrush

Tristan Thrush

2 years

We released initial pretraining datasets for the Online Language Modelling project, where we will train language models on each new Common Crawl snapshot. But what secrets await us in these random-ish internet samples 🕵️? What differs between monthly snapshots? A 🧵:

Tweet media one

2

19

137

@TristanThrush

Tristan Thrush

1 year

Another exciting release in the Online Language Modelling project. Our December 2022 RoBERTa/BERT and GPT2 are out! Like the previous models, they do better on standard benchmarks than the originals. Dec RoBERTa/BERT: Dec GPT2:

Tweet media one

Tweet media two

Tweet media three

Tweet media four

2

21

122

@TristanThrush

Tristan Thrush

1 year

We're doing it! Started training a bert-base-uncased last night on the latest OLM data. It's not close to completion yet (hasn't even completed an epoch), but here's an initial result. More models coming soon, including causal LMs!

Tweet media one

@TristanThrush

Tristan Thrush

2 years

We’re going to do it! We’ll train and release masked and causal language models (e.g. BERT & GPT-2) on new Common Crawl snapshots as they come out! We call this project Online Language Modeling (OLM). What applications or research questions can we enable or help answer? A 🧵:

10

62

498

7

16

89

@TristanThrush

Tristan Thrush

1 year

Podcast episode is out! We discuss: - Research life at MIT vs. Facebook/Meta AI vs. Hugging Face. (Academia vs. large corporation vs. startup.) - How we know that #AIArt systems don't understand word order - Where #ChatGPT fails - when making a model bigger makes it worse

@CoexistWithAI

Coexisting with AI

1 year

It's the episode you've been waiting for! Hugging Face Engineer Tristan Thrush is here to talk about how AI ability is *not* close to human ability. Simplecast: #AiResearch #AiLies #AiChallenges #artificialIntelligence #AI #machineLearning #ML #podcast

1

3

12

0

9

70

@TristanThrush

Tristan Thrush

2 years

Stop by the Winoground poster at #CVPR today. @apsdehal and I will be there in person! SOTA models (FLAVA, CLIP, etc.) do below chance on our dataset, and authors of closed models like Imagen are too scared to report results. How do we create V&L models w compositional knowledge?

Tweet media one

1

7

62

@TristanThrush

Tristan Thrush

6 months

The year is 2028. The Feds have infiltrated Soumith Chintala's secret GPU bunker in Montana that he was using to push open-source weights. GPT-7 align-o-matic™ drones have just found Yann LeCun's cave dwelling. There is no more hope 😢

@soumithchintala

Soumith Chintala

@soumithchintala

6 months

In 270 days, the Department of Commerce will determine whether they will allow open-weights or not. if you support open model weights and want something actionable to do, then figure out how to lobby your opinion to them.

Tweet media one

35

169

718

0

2

48

@TristanThrush

Tristan Thrush

2 years

New ACL 2022 System Demo paper! It used to take a lot of technical effort to set up custom AI tasks, evaluate models, and collect crowdworker data with models in-the-loop. We’ve added a new framework to @DynabenchAI that aims to help: Dynatask. 1/3

2

12

47

@TristanThrush

Tristan Thrush

5 months

@cunha_tristan @pfau No way, my parents named me Tristan Thrush, without realizing that "Tristan Thrush" is the exact name of a bird that lives in Tristan da Cunha. This is absolutely wild.

3

0

46

@TristanThrush

Tristan Thrush

2 years

BERT and GPT-2 are downloaded a whopping 30 million times every month on the @huggingface Hub, but they live in the past. They both think that Obama is still president and have never heard of COVID! 😱

Tweet media one

Tweet media two

2

6

45

@TristanThrush

Tristan Thrush

8 months

According to the statistics I've kept since moving into Stanford last night, 100% of Stanford students are business students. Or maybe it's just that the business students are the only ones who are willing to make eye contact and introduce themselves to me 😅

1

2

43

@TristanThrush

Tristan Thrush

4 months

Will write up a thread on our metalinguistic self-reference tests for LLMs tomorrow morning. Until then, enjoy this LLaMA looking at itself in a mirror. Does it understand?

Tweet media one

@Teknium1

Teknium (e/λ)

4 months

I was surprised to learn there is still alpha in browsing arxiv for the latest papers yourself rather than waiting for twitter to surface the good ones. Here's a couple released today that looked interesting: I am a Strange Dataset: Metalinguistic Tests for Language Models…

Tweet media one

Tweet media two

18

24

286

2

3

38

@TristanThrush

Tristan Thrush

3 months

📢 New multimodal benchmark and results on a simple (yet not trivial!) task. We hope that this dataset will be useful in AI art, visual language modeling, multimodal retrieval, and possibly even mechanistic interpretability.

@top34051

Top Burapacheep

3 months

📢 New short paper preprint with a new multimodal benchmark: ColorSwap!! Models have issues with color and word order compositionality. It is important #AIArt models get it right. I asked for a blue orange and an orange blueberry! Not an orange orange and a blue blueberry! 🧵

Tweet media one

2

8

34

1

4

37

@TristanThrush

Tristan Thrush

1 year

I’ve also made the difficult decision to leave the amazing team at Hugging Face. I will take the time between now and the official start of my PhD to travel, exercise, think deeply about research ideas, reach out to potential collaborators, and get some new things started!

1

0

37

@TristanThrush

Tristan Thrush

1 year

Great news! The Online Language Modelling project has it's first model trained by a community member: the amazing @Muhtasham9 ! It's a TinyBERT trained on the OLM December 2022 pretraining dataset. Want a tiny and up-to-date language model? Check it out:

Tweet card media

muhtasham/olm-bert-tiny-december-2022 · Hugging Face

0

2

34

@TristanThrush

Tristan Thrush

4 months

Huge thanks to the rest of the team: @jaredlcm , @migueljmonares , @ChrisGPotts , @douwekiela . Paper: Dataset:

Tweet card media

GitHub - TristanThrush/i-am-a-strange-dataset: Repository for "I am a Strange Dataset: Metalingui...

Repository for "I am a Strange Dataset: Metalinguistic Tests for Language Models" - TristanThrush/i-am-a-strange-dataset

2

0

33

@TristanThrush

Tristan Thrush

2 years

What do people want more of this fall? According to our internet snapshots, it's drugs 🌿, money 🤑, and romance 😍 (in that order!). We carefully examined internet snapshots for our Online Language Model project (models coming soon!). What are the findings? A 🧵:

Tweet media one

2

6

34

@TristanThrush

Tristan Thrush

11 months

Woah Bing Chat can understand images now! I was giddy with excitement to try some Winoground images on it! I was super surprised to find that even Bing Chat doesn't get the yes/no right, and sometimes doesn't understand at all. 🤯 Maybe it isn't actually using multimodal GPT-4?

Tweet media one

Tweet media two

Tweet media three

4

3

34

@TristanThrush

Tristan Thrush

5 months

Winoground is a simple multimodal eval that requires an understanding of word order. It’s been out for nearly 2 years. Surely GPT4V can do it now, right? Wrong! It is the best known model, but GPT4V still only gets about 38% on the main metric!! 🙌 @ChengleiSi @aryaman2020

@ChengleiSi

CLS

5 months

I saw debates on whether GPT-4V can “solve” compositionality, so I spent my precious Friday afternoon benchmarking it on Winoground. Tldr: NO it’s still far from solved (GPT-4V 38.0% vs PaLI 28.8% vs MTurk Humans 85.5%). Colab w/ all results: 🧵(1/n)

7

49

334

1

3

33

@TristanThrush

Tristan Thrush

1 month

Are you not reporting confidence intervals? Then you're part of a PsyOp by the PhD student hivemind in order to publish tons of papers that don't matter.

4

1

32

@TristanThrush

Tristan Thrush

6 months

Whosoever holds the Radford possesses the power of AGI

@beenwrekt

Ben Recht

6 months

A bunch of people I trust and respect think the significant, valuable contributions of OpenAI are 95% due to a single person. Why doesn't MSFT just pay that person a lot of money and give them infinite Azure credits? Problems solved.

9

6

104

1

1

29

@TristanThrush

Tristan Thrush

4 months

For most works, I would actually prefer to submit to arxiv over ACL if I can't make the anonymity deadline. I might actually do that in the coming days, we will see!

@zacharylipton

Zachary Lipton

4 months

Just learned despite everyone voting down *CL's 🤡-y arxiv embargo policy, it's still firmly in place for ACL 2024. If *CL were a company, the board & leadership wd be fired, the talent wd've left 5 years ago, the common stock wd be worth $0, & WSB wd be taking an interest.

10

13

127

0

1

30

@TristanThrush

Tristan Thrush

1 year

Several mentors had a really profound impact on my research career. It takes a village. @douwekiela , @adinamwilliams , @roger_p_levy , @candacerossio , @renauddetry , @mmitchell_ai , Patrick Winston, Russ Tedrake, Josh Tenenbaum, plus many more. Thank you so much!

2

0

30

@TristanThrush

Tristan Thrush

6 months

A qualification test for a job that is essentially sampled i.i.d. from the actual job. I've never understood why this isn't the de-facto way to interview everywhere.

@__tinygrad__

the tiny corp

6 months

Want a job at tiny corp? Join the discord, get a PR merged, solve a bounty, 12 week internship, full time employee. No resumes, phone screens, whiteboard coding, hackerrank, references, etc… Just a demonstration of skill and motivation.

25

89

1K

2

1

26

@TristanThrush

Tristan Thrush

2 years

Of course, we’ll open source everything we do and make our tools available to others. Wanna get involved? Reach out, and join the discord for updates:

1

2

28

@TristanThrush

Tristan Thrush

4 months

On the other hand, people find this task easy. Human annotators from Amazon Mechanical Turk got 89-93% depending on the metric. Unlike models, humans reliably know what is going on when this tweet says that it has three sentences.

1

0

27

@TristanThrush

Tristan Thrush

4 months

Each example in the dataset consists of two self-referential statements that begin in the same way but have different endings. One is true and one is false. Crucially, the ending flips the truth value of the statement.

Tweet media one

2

0

24

@TristanThrush

Tristan Thrush

2 years

Excited about the new Evaluation-on-the-Hub tool on Hugging Face 🤩? Sad that you couldn’t filter leaderboards by task 🥺? Now you can 🤯! We’ve released a new feature that allows you see leaderboards for a selected task. Check it out:

Tweet media one

0

6

26

@TristanThrush

Tristan Thrush

4 months

It turns out I actually could've enjoyed the holidays 😭

@aclmeeting

ACL 2024

4 months

ACL announcement: "The ACL Executive Committee has voted to significantly change ACL's approach to protecting anonymous peer review. The change is effective immediately. (1/4) #NLPRoc

5

189

443

2

0

26

@TristanThrush

Tristan Thrush

2 years

How could models stay ⏱️up to date? With so many downstream models standing on the shoulders of these two giants, it’s not easy to change the status quo. How do we capture gradual meaning change + abrupt fact change? These are interesting and poorly understood research questions.

Tweet media one

1

1

23

@TristanThrush

Tristan Thrush

5 months

The sunflower I'm growing will never understand how pretty it is. I wonder what kinds of things might look at us and feel the same way?

Tweet media one

1

0

23

@TristanThrush

Tristan Thrush

1 year

@shaily99 Yes I don't think I need a PhD to do NLP research. But have you seen all of the cool people at Stanford?

1

0

22

@TristanThrush

Tristan Thrush

1 year

Stoked to share this podcast episode teaser. We talk about some of the most exciting issues to solve in the next generation of AI: #AIArt systems don’t understand word order #ChatGPT doesn’t know who the president is and makes stuff up Making models bigger can make them worse

@CoexistWithAI

Coexisting with AI

1 year

Our next episode will change how you think about the future of AI. Tristan Thrush, a booming innovative force in AI research, will be talking about the critical AI challenges. #AiLies #AiResearch #AiChallenges #artificialIntelligence #machineLearning

0

2

5

3

3

21

@TristanThrush

Tristan Thrush

2 years

As we continually train new models, a slow form of reinforcement learning may emerge. What actions can we take to help the models improve over time? How do we ensure that the models remember/forget the right things? Can we exploit any concepts from RL research?

2

0

21

@TristanThrush

Tristan Thrush

4 months

We tested open source models from 7B to 70B parameters, including LLaMA 2, Mistral, and Mixtral. We also tested leading API models such as GPT-4 and Claude 2. They all got around chance (50%). GPT-4 is the only one to stay significantly above chance, but not by much (~59-66%).

Tweet media one

1

0

21

@TristanThrush

Tristan Thrush

8 months

Let me know when one of these models can beat Winoground, I'm tired of watching all of the big new releases still fail. None of them can even understand word order afaik 🥱😴

@DrJimFan

Jim Fan

8 months

I think DALL·E 3 is not just a stance against MidJourney. It's actually a sneak peak of the upcoming, epic battle of massively multimodal LLMs, against DeepMind Gemini. Quote: "DALL·E 3 is built natively on ChatGPT". This is the key phrase. DALL·E 3's extraordinary language…

Tweet media one

96

384

3K

1

0

20

@TristanThrush

Tristan Thrush

4 months

LLMs have had issues with negation for a long time too. It seems that multimodal LLMs have similar problems but are a few years behind their purely-language counterparts. Same with word-order, etc.

@FrankyBallarani

Franky Ballarani — e/acc

@FrankyBallarani

4 months

@GaryMarcus Def not c3po

Tweet media one

Tweet media two

3

1

8

1

1

19

@TristanThrush

Tristan Thrush

4 months

In fact, LLMs still don’t reliably know what’s going on when this tweet says that the previous tweet has three sentences. Most are around chance at non-self-referential metalinguistic problems too. Although GPT-4 seems to struggle more with the self-referential framing.

1

0

20

@TristanThrush

Tristan Thrush

6 months

My new favorite test for diffusion models is whether they can generate an image of an orange blueberry riding a blue orange. Horse riding astronaut is so last month.

2

0

19

@TristanThrush

Tristan Thrush

1 year

Thanks to all of the authors of WebGPT and to @natolambert for reaching out to them and organizing!

1

1

19

@TristanThrush

Tristan Thrush

4 months

We found a trend of improvement with scale, but all of the models are still extremely limited. Will this trend continue? How much scale do we need to generalize correctly on metalinguistic self-reference?

Tweet media one

1

0

17

@TristanThrush

Tristan Thrush

11 months

Check it out! It turns out that you can just give an LLM captions to get competitive multimodal task performance - it's even better than OpenFlamingo V2 in some cases. Although far from perfect, this is a very strong baseline model!

@w33lliam

William Berrios

11 months

Announcing LENS 🔎, a framework for vision-augmented language models. - Outperforms Flamingo by 9% (56->65%) on VQAv2 - Eliminates the additional cost of multimodal pre-training Demo: Blog+Paper+Code: A 🧵 [1/N]

2

42

192

0

0

15

@TristanThrush

Tristan Thrush

9 months

Except of course if we added Winoground to this plot, it would basically look like a flat line around random performance, and it's been that way for a year (sorry I couldn't resist the plug 😅)

1

2

15

@TristanThrush

Tristan Thrush

11 months

Finally able to talk about how excited I am to help with this!!!

@douwekiela

Douwe Kiela

11 months

Super excited to announce that @apsdehal and I have launched a new company: @ContextualAI ! Why did we start it? Because LLMs are going to radically change the way enterprises operate, and we see a huge need for LLMs that actually work for enterprise use cases. 1/5

38

39

427

1

1

14

@TristanThrush

Tristan Thrush

4 months

@natolambert Thanks for the credit 🤣🤣🤣. I just want to chime in and let people know that I have no affiliation with "Waifu Research Department". I just saw it on HF one day when looking for diffusion finetuning examples 💀

3

0

13

@TristanThrush

Tristan Thrush

8 months

TLDR: He posted in a comment below that these images are cherry-picked and DALLE-3 actually doesn't solve this problem reliably. Still cool images though! Does anyone from an AI art company want to try some new (but simple) training ideas w me? Medium-risk high-reward imo.

@willdepue

will depue

8 months

DALLE-3 solves “horse riding astronaut” prompt for everyone that’s asking.

Tweet media one

Tweet media two

Tweet media three

86

160

2K

0

1

13

@TristanThrush

Tristan Thrush

4 months

@BBacktesting In my view, if the tokenizer is the problem, then that's interesting too, right? For whatever reason, humans do well and models don't do well, given the same string. So this test might reveal how it is important to change the tokenizer to get human generalization.

3

0

13

@TristanThrush

Tristan Thrush

2 years

To explore this together with the community, we can start by pretraining a model from scratch every time a Common Crawl snapshot comes out, or continuously keep pretraining the same model. But how do we weight the data? And what else should we try?

2

0

13

@TristanThrush

Tristan Thrush

4 months

We introduce several metrics for automatic evaluation. We test models both for their ability to generate true self-referential statements, and validate complete self-referential statements as true or false.

1

0

13

@TristanThrush

Tristan Thrush

3 months

Not sure if this is actually cause and effect. But the stock price maps to my personal confidence pretty well. I left Meta FAIR earlier than I had expected right after they started the Metaverse focus. I lost some faith during that time, but now my faith is very much back!!

@BrianRoemmele

Brian Roemmele

3 months

How did open source AI change Meta’s stock value? Welp I think this may speak loudly. Meta is many things but one thing history will record is they saved AI to be open source. The markets agree:

Tweet media one

47

100

586

1

0

12

@TristanThrush

Tristan Thrush

1 year

What are the implementation details? We’ve open-sourced everything, and we hope you find it easy to use! You can use our tools to pull the latest data from across the web, clean it, and pretrain models. Data 👉 Training 👉

1

2

12

@TristanThrush

Tristan Thrush

2 months

@JesseDodge I can think of two ways this could be fine off the top of my head: 1. Human-preferred ai content is naturally upweighted on the internet. Human input remains. 2. True/useful generations are naturally upweighted on the internet (because AI-generated code didn't crash, etc.)

0

0

12

@TristanThrush

Tristan Thrush

10 months

@WenhuChen If only first author, then I don't think that even Karpathy meets this qualification 😂

0

0

11

@TristanThrush

Tristan Thrush

1 year

@0x_y4c0 @OpenAI I think that this blog post is a great intro to RLHF:

Tweet card media

Illustrating Reinforcement Learning from Human Feedback (RLHF)

1

1

11

@TristanThrush

Tristan Thrush

2 years

Why is it important? BERT and GPT-2 both live in the past. They think that Obama is still president and have never heard of COVID! To fix this, we need a pretraining dataset that continuously updates.

Tweet media one

Tweet media two

1

0

10

@TristanThrush

Tristan Thrush

2 years

The task: Given two images and two captions, the goal is to match them correctly—but crucially, both captions contain the same words/morphemes, only in a different order. Identical words between captions means that BOW models cannot perform above chance. 2/5

Tweet media one

1

1

10

@TristanThrush

Tristan Thrush

1 year

@bwhite5290 A few directions: 1. How do we get V&L models to beat Winoground? 2. How do we keep our models up-to-date? It would be nice if we could just tell ChatGPT "remember that x is president now". 3. How do we bring large-scale pre-training to the real world, with e.g. robots?

3

0

10

@TristanThrush

Tristan Thrush

1 year

Overall: If you want to use a more up-to-date BERT, go here 👉 If you want to use a more up-to-date GPT2, go here 👉 Stay tuned for the next models, which will be trained with December data!

1

2

10

@TristanThrush

Tristan Thrush

2 years

@jeffistyping A first step could be to use the open-source scripts that bigscience used to pull and process the common crawl data

Tweet card media

GitHub - bigscience-workshop/data-preparation: Code used for sourcing and cleaning the BigScience...

Code used for sourcing and cleaning the BigScience ROOTS corpus - bigscience-workshop/data-preparation

0

0

10

@TristanThrush

Tristan Thrush

7 months

For some reason, I really like Mistral's pixelated "M" logo. The design of French AI company logos continues to be 10/10

0

0

9

@TristanThrush

Tristan Thrush

2 years

Also curious about this @GoogleAI . Does your model stand a chance against Winoground?

Tweet card media

facebook/winoground · Datasets at Hugging Face

@GaryMarcus

Gary Marcus

2 years

Dear @GoogleAI , Three months have passed since you claimed Imagen improved performance on compositionality. I asked for access and you didn’t respond. @TristanThrush offered you his Winoground materials; you didn’t respond. Why not? cc @Chitwan_Saharia @blaiseaguera

2

2

21

1

4

9

@TristanThrush

Tristan Thrush

2 years

New pretraining dataset, this time from a May 2017 snapshot of the internet: It is easy to run our pipeline on any Common Crawl snapshot, and the community has expressed interest in comparing text from years ago with the text in our latest datasets.

olm/olm-CC-MAIN-2017-22-sampling-ratio-0.16178770949 · Datasets at Hugging Face

0

0

9

@TristanThrush

Tristan Thrush

1 year

Looks like we are still in dire need of Online Language Models after trying out #ChatGPT ! Hopefully our project will lead to insights about how we can update even the largest of models, like this one, effectively and efficiently with new information.

Tweet media one

@TristanThrush

Tristan Thrush

2 years

We’re going to do it! We’ll train and release masked and causal language models (e.g. BERT & GPT-2) on new Common Crawl snapshots as they come out! We call this project Online Language Modeling (OLM). What applications or research questions can we enable or help answer? A 🧵:

10

62

498

0

1

9

@TristanThrush

Tristan Thrush

4 months

@daprofile @ESYudkowsky Very well, crowdworkers are at ~90% with essentially the same prompts that we give the models

0

1

9

@TristanThrush

Tristan Thrush

2 months

I bet it isn't even aware of it's own outputs

@tegmark

Max Tegmark

2 months

To what extent is the new Claude3 AI self-aware?

189

40

340

0

0

8

@TristanThrush

Tristan Thrush

2 years

@GaryMarcus @raphaelmilliere The model can be run on the prompts from Winoground right now, @Chitwan_Saharia . The dataset is available on Hugging Face:

Tweet card media

facebook/winoground · Datasets at Hugging Face

0

0

8

@TristanThrush

Tristan Thrush

2 years

Today, anyone can select models, datasets, and metrics on the Hugging Face hub and get the evaluation results automatically! Very important feature for practitioners to choose models, for researchers to test a dataset on lots of models, and for reproducibility efforts!

@_lewtun

Lewis Tunstall

2 years

Excited to share a new tool we’ve built called Evaluation on the Hub 🔥🔥🔥! With this tool you can evaluate any model on any dataset with any metric🤯 Evaluate your models here👉 Let’s take a look at how it works 🧵 1/

Tweet media one

7

81

299

0

0

8

@TristanThrush

Tristan Thrush

4 months

@aryaman2020 @Teknium1 Hehe exciting!

Tweet media one

1

2

8

@TristanThrush

Tristan Thrush

2 years

Our 2022 summer Common Crawl OLM datasets are right here, and we invite you to do your own analysis and tell us what you find! August: June/July: May:

1

2

8

@TristanThrush

Tristan Thrush

1 year

Woah excited to see Winoground in the top 100 most cited AI papers of 2022!!

@ylecun

Yann LeCun

1 year

At @MetaAI we favor publication quality over quantity. That's why among the 100 most cited AI papers in 2022, @MetaAI has authored (or co-authored) 16, ranking 2nd just behind Google with 22. Our research is having a large impact on the community. (and NYU ranks nicely, too).

32

58

684

0

2

7

@TristanThrush

Tristan Thrush

5 months

@giffmana @ChengleiSi The PaLI numbers actually come from this paper which is authored by people from Google. They even finetuned it on additional data, etc. for an extra Winoground edge, I think:

1

0

6

@TristanThrush

Tristan Thrush

9 months

Benchmarks continue to saturate even more quickly 🤯.

@douwekiela

Douwe Kiela

9 months

Progress in AI continues to outpace benchmarks. Check out this new plot, inspired by @DynabenchAI , that shows just how quickly it's happening. Read more about it here:

Tweet media one

6

28

116

1

1

7

@TristanThrush

Tristan Thrush

2 months

@khoomeik This is a really cool direction - I'm glad I get to say that I knew you before you became one of the godfathers of scaling :D

1

0

7

@TristanThrush

Tristan Thrush

2 years

New words! It turns out that the world changes a lot in a few months. In our pretraining datasets, we found a reflection of new events that occurred and were amplified over the summer. In the graph below, you can see talk about these terms increasing throughout the summer.

Tweet media one

1

1

7

@TristanThrush

Tristan Thrush

1 year

@DrJimFan Or benchmarks could be both open-source and dynamic!

Tweet card media

Dynabench: Rethinking Benchmarking in NLP

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation:...

0

0

7

@TristanThrush

Tristan Thrush

1 year

Thank you for all of your leadership on this. Very important paper and I am happy that I got to contribute a bit!

@mmitchell_ai

MMitchell

1 year

I am bursting with excitement to finally share an idea that has been cooking for awhile: Measuring Data When you "measure data", you quantify its characteristics to support dataset comparison & curation. You also begin to know what systems will learn.

18

122

665

0

0

7

@TristanThrush

Tristan Thrush

2 months

@GaryMarcus There's a benchmark for this stuff!

@TristanThrush

Tristan Thrush

4 months

📢 New paper!! Do LLMs understand self-referential statements? Introducing “I am a Strange Dataset”. All tested models perform around chance at our metalinguistic self-reference task. GPT-4 is the only model significantly above chance on all tests, but it is slight.🧵

Tweet media one

22

72

475

0

1

7

@TristanThrush

Tristan Thrush

2 years

We found that all of these models are very poor overall: FLAVA, CLIP, UNITER, ViLLA, VinVL, VisualBERT, ViLT, LXMERT, ViLBERT, UniT, VSE++, and VSRN. Can your model do better? 3/5

1

0

7

@TristanThrush

Tristan Thrush

1 year

Can we know *for sure* that any of these models' generations are truly novel, without manually inspecting every last training image? These models could be even less compositional than we thought ()

Tweet card media

Winoground: Probing Vision and Language Models for...

We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two...

@Eric_Wallace_

Eric Wallace

1 year

Models such as Stable Diffusion are trained on copyrighted, trademarked, private, and sensitive images. Yet, our new paper shows that diffusion models memorize images from their training data and emit them at generation time. Paper: 👇[1/9]

Tweet media one

174

2K

10K

1

0

7

@TristanThrush

Tristan Thrush

4 months

@GaryMarcus @Rahll Related:

@JulieKallini

Julie Kallini ✨

5 months

ChatGPT: Sorry, I can't draw copyrighted characters like Sonic the Hedgehog. Also ChatGPT: Wow, Sonic the Hedgehog sounds like a fun and original character!

Tweet media one

Tweet media two

3

22

105

0

0

7

@TristanThrush

Tristan Thrush

1 year

Why is this project important? BERT and GPT2 are still two of the most downloaded models on the Hugging Face Hub, but they have no idea what COVID is or who the current president is. We can take a step towards fixing this by re-training them on new data continuously.

2

0

7

@TristanThrush

Tristan Thrush

2 years

@max_nlp @GaryMarcus @raphaelmilliere Yes, it can be run on Winoground prompts and evaluated with annotators in a way that is similar to what they did with DrawBench. @Chitwan_Saharia and coauthors can reach out if they need help understanding the setup. Winoground is available for them to use whenever they want.

0

0

7

@TristanThrush

Tristan Thrush

2 years

The dataset was hand-curated by a group of expert annotators and validated by crowdworkers. To assist in analyzing model performance, the annotators tagged examples from a set of 70 fine-grained linguistic tags, 5 coarse linguistic tags, and 3 visual tags. 4/5

Tweet media one

1

0

7

@TristanThrush

Tristan Thrush

1 year

@CoexistWithAI is a new AI podcast which tries to make the conversations understandable by the general public. Ya'll should follow them! Stay tuned for the episode

0

1

6

@TristanThrush

Tristan Thrush

4 months

@ESYudkowsky Do you think that LLMs can truly be self-aware if they aren't even aware of their own outputs?

@TristanThrush

Tristan Thrush

4 months

📢 New paper!! Do LLMs understand self-referential statements? Introducing “I am a Strange Dataset”. All tested models perform around chance at our metalinguistic self-reference task. GPT-4 is the only model significantly above chance on all tests, but it is slight.🧵

Tweet media one

22

72

475

1

0

5

@TristanThrush

Tristan Thrush

4 months

@aryaman2020 Chomsky's been real quiet since this dropped

1

0

6

@TristanThrush

Tristan Thrush

3 months

@giffmana @natolambert Ok I will do this just for you: if an undergrad wants to take this on, I will give them some advice about how to make a cool waifu-related benchmark

2

0

6

@TristanThrush

Tristan Thrush

1 year

Of course, the original bert-base-uncased doesn't know what's going on.

Tweet media one

0

0

6

@TristanThrush

Tristan Thrush

5 months

I like this analogy. Seems very sage, especially coming from an academic.

@ChrisGPotts

Christopher Potts

5 months

There is an episode of Parks & Rec that is actually about AI research. Ron and Chris have a cooking competition. Chris spends all day crafting a fancy custom sandwich. Ron buys hamburger meat from a convenience store. The hamburger wins. Often, in AI, the hamburger wins.

6

9

118

0

0

6

@TristanThrush

Tristan Thrush

11 months

@SashaMTL @emilymbender @annargrs @alkoller To see that it isn't solved, an example that I like to use is this: Ask a diffusion model to generate "two forks and three spoons" versus "three forks and two spoons". The models might get examples like this correct sometimes, but as far as I know, they still aren't reliable.

2

0

6

@TristanThrush

Tristan Thrush

4 months

@giffmana Haha nice I'm honored 🙏

0

0

5

@TristanThrush

Tristan Thrush

2 years

Thanks for the incredible teamwork! @this_is_ryanj , @max_nlp , @apsdehal , @adinamwilliams , @douwekiela , @candacerossio . We hope this dataset and task for visio-linguistic compositional understanding will contribute towards the development of truly grounded multimodal models! 5/5

2

0

5