How can we reduce the computational cost of training neural networks?
Bo Zhao, Hakan Bilen and collaborators have produced a creative body of work developing a technique known as "dataset condensation".
1/7
Just how striking are the recent language model results with Flan-PaLM?
Here's a plot.
Across 57 tasks on mathematics, US history, computer science etc., Flan-PaLM surpasses **both** the June 2023 and June 2024 SotA forecasts from this summer by competitive forecasters.
1/3
Finetuning language models on instructions increasingly seems a compute-efficient way to gain performance.
Recent work from
@hwchung27
,
@_jasonwei
,
@JeffDean
,
@quocleix
& others scales this up to new regimes.
TLDR: Even for big models (540B params), gains are substantial.
1/12
There has been an explosion of NLP research in prompting techniques for communicating tasks to language models.
But writing and sharing good prompts is awkward.
PromptSource is a tool that was developed as part of
@BigscienceW
to tackle this challenge.
🧵1/11
A small personal update:
- Excited to join Google DeepMind 🚀
- Grateful for the wonderful humans I've had the pleasure of working with on my journey so far at
@Cambridge_Eng
and
@Oxford_VGG
❤️
1/ 🚀🔬 Introducing our groundbreaking research paper: "Large Language Models are Few-shot Publication Scoopers"
We've discovered the secret to achieving personal glory and a lifetime supply of Cheerios
Joint work with
@LiliMomeni
and J. F. Henriques
Appears
@sigbovik
today
BLOOM.
A large language model trained by researchers from around the world by
@BigscienceW
.
How did they do it?
Why did they do it?
Let's dive in.
1/21
🧵
GPT4Geo
- studies GPT-4's geographic knowledge & reasoning
- suggests GPT-4 can plan complex journeys, describe the global semiconductor supply chain and roughly reconstruct the Hong Kong MTR map
With
@J_Roberts_1
, Timo, Sowmen,
@kaihan_vis
TLDR: Human feedback is key to LLMs, but it is not a panacea
- it under-values some aspects (e.g. factuality)
- is biased (e.g. assertive text is judged more factual)
A nice example of the empirical science of annotation
By
@tomhosking
Blunsom
@max_nlp
LLMs as Tool Makers
- uses LLMs to create their own reusable tools (Python functions) for problem-solving
- allows a lighter model to use tools built by a heavier model relatively cheaply
By
@tianle_cai
, X. Wang,
@tengyuma
,
@xinyun_chen_
,
@denny_zhou
TLDR: Unsupervised knowledge discovery in LLMs is hard
Intriguing theoretical and empirical results from
@seb_far
et al.
Paper:
And for those who enjoy video summaries:
VisionLLM
- Key idea: treat images as a foreign language for a generalist LLM decoder
- Strong performance on object detection (60 mAP on COCO)
- paper:
by W. Wang,
@PKUCXK
et al.
TLDR: Emergent capabilities appear due to the choice of
- nonlinear, or
- discontinuous
metrics
Work by
@RylanSchaeffer
et al. (Outstanding paper, NeurIPS '23)
Paper:
Also recommended - some nuances by
@boazbaraktcs
:
*TLDR* Major gains in pretraining efficiency/quality by
- filtering data with an LLM judge and
- asking the judge to only keep the "informative" stuff
Work by
@noveens97
et al.
Paper:
Semantic segmentation is valuable, but it remains costly and painful to scale up.
ReCo (NeurIPS 2022) aims to tackle this problem by using:
- the retrieval abilities of CLIP
- the co-segmentation abilities of vision transformers
Here's how it works.
🧵1/9
TLDR: Using an LLM to rephrase text documents to be "in high quality English language as in sentences on Wikipedia" can achieve ~3x faster LLM pretraining
Work by
@pratyushmaini
et al.
Paper:
Today I'll give my final lecture on data structures & algorithms
@Cambridge_Eng
@Cambridge_Uni
😢
But, for those keen to study:
- re-recorded videos
- slides
- and code
are all available online:
(the fun Red-Black Tree vis. is based on work by
@lsbardel
)
GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning
- RL agents have low sample efficiency on open-ended games
- GPT-4 works better by:
(i) reading instructions
(ii) selecting next action
By
@yw_yuewu
,
@shrimai_
@rsalakhu
,
@ybisk
et al.
Using ChatGPT to explore a Computer Vision/ML research project - a mini-collaboration.
Investigator: How can SENet ideas improve ViT?
ChatGPT: Plug the SENet module into the ViT architecture.
OK... reasonable enough.
So down the rabbit hole we go...
1/9
AlignScore
Motivation: checking factual consistency is hard work
Key idea: train general text alignment function, then use as building block to assess factual consistency
By
@yzha_zha
,
@ZhitingHu
et al.
Multitask prompted finetuning (aka instruction finetuning) can boost language model performance.
But how can we make progress beyond English (esp. on languages with limited finetuning data)?
Work by
@Muennighoff
& others in
@BigscienceW
studies this in detail.
1/17 🧵
The False Promise of Imitating Proprietary LLMs
- imitation improves "style, persona & instruction adherence of open-source LMs"
- but "falls short... on more challenging aces such as factuality, coding & problem solving"
Paper:
By
@arnavg_
,
Let’s Verify Step by Step
- finds process-supervision outperforms outcome-supervision on maths problems
- potential example of a "negative alignment tax" (good for alignment + capabilities)
By
@HunterLightman
et al.
Do you like morning jogs?
Do you enjoy speculating about the future of AI?
Are you attending
@ICCVConference
?
If you answered yes to all three, meet at 8 am Wed, Thur, Fri @ OKKO Hotels Porte De Versailles entrance.
All welcome.
1/2
Flan-PaLM was part of a study on scaling up instruction finetuning by
@hwchung27
,
@_jasonwei
& others at
@Google
Gains from:
- bigger models...
- more tasks (but diminishing returns)
- chain-of-thought finetuning
- chain-of-thought prompting with self-consistency
2/3
Are Multimodal LLMs the future for Computer Vision?
Kosmos-2 is a new model from Microsoft Research
It has quite a broad range of tricks up its sleeve (including grounding)
An overview of the work 👇
Links:
Flan-PaLM:
MMLU benchmark of 57 tasks (explains human baselines):
Forecasts: (relevant forecasts updated Aug 15th 2022)
Useful context for forecasts by
@JacobSteinhardt
:
3/3
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
- goes big on imitation learning (includes 1M GPT-4 responses)
- outperforms Vicuna-13B "by more than 100% in complex zero-shot reasoning benchmarks"
By
@subho_mpi
et al.
Does CoT really reveal the reasoning process of an LLM?
Perhaps..
But then again, perhaps not
New work from Anthropic studies this question empirically:
"Measuring Faithfulness in Chain-of-Thought Reasoning" by T. Lanham et al.
An overview👇
What’s a simple way to improve video retrieval? Use more modalities & deal with noise! Excited to announce latest work with Yang Liu,
@NagraniArsha
& Andrew Zisserman. SoTA on five video benchmarks.
Paper:
Code/Models:
#bmvc2019
Struggling to keep up with recent AI developments?
Try **AI News with Samuel Albanie**
A weekly dose of research papers, tools & resources
The
#1
AI news show with Samuel Albanie, as voted by me
🤗 Datasets: A community library for natural language processing (and other fields too)
From
@qlhoest
and a wide range of contributors across
@huggingface
and beyond
TLDR:
- Widely used benchmarks like HumanEval lack test coverage
- EvalPlus synthesises new test-cases to cover gaps
- Consequence: HumanEval ranking changes for some models
Work by
@JiaweiLiu_
et al.
Paper:
This is an amazing piece of work on continual learning from
@xu__ji
and collaborators
@Oxford_VGG
, using a single unified model to synthesize artificial replay samples on the fly during training. The benefits of experience replay, but without a buffer
Do LMs Know When They're Hallucinating References?
- finds many fabrications can be identified using only black-box queries.
- most useful on more powerful models like GPT-4
By A. Agrawal,
@LesterMackey
,
@adamfungi
Here's one reason I think longer context windows (e.g. 10M tokens for Gemini 1.5) are a big deal for software dev:
the whole codebase can be in context
The original HN comment responds to the question "How are some people exceptionally productive?":
PaLM-2 vs other LLMs
- Comparison made in Chatbot arena by
@lmsysorg
- Major gap in Elo Rating (GPT-4 vs PaLM-2)
- Some caveats in the thread below
1/2
TLDR: A new family of lightweight LLMs (2B and 7B params)
- 7B model is trained 6T tokens on 4096 TPUv5e
- weights available for commercial use
Work from Google DeepMind
Paper:
Can you prove which data was used to train an AI?
New techniques from
@damichoi95
,
@yonashav
and
@DavidDuvenaud
suggest the answer may be "yes"
An overview of the work 👇
ToolkenGPT
Key idea: represent tools as tokens for LLMs
Strong performance vs in-context learning on question answering
Paper:
by
@Ber18791531
,
@ZhitingHu
and others
**Seeking feedback**
- I'd like to improve my AI news YouTube videos
- I'd greatly appreciate any constructive criticism
- the feedback is anonymous
Give feedback here:
The news videos can be found here:
We're excited to announce that the Video Pentathlon is now live!
Test out your video retrieval skills on five challenging benchmarks: MSRVTT, MSVD, YouCook2, ActivityNet and DiDeMo. More here:
Baselines and features provided!
#CVPR2020
#video
Related in this space includes:
- Dataset distillation
@TongzhouWang
et al. (arxiv '18)
- Label distillation
@OBohdal
et al. (NeuriPS workshop '20)
- KIP by
@IAmTimNguyen
et al. (ICLR '21)
4/7
TLDR: Getting LLMs to debate options helps humans choose the right answer
Recent work by
@AkbirKhan
et al.
Paper:
It's interesting to read some of the debates (nicely formatted here: )
Getting ViT in Shape
- Compute-optimal shapes allow for smaller models w. same acc. & same compute
Rules of thumb:
- Scale MLP dim. faster than depth
- Scale depth faster than width
by
@ibomohsin
,
@XiaohuaZhai
,
@__kolesnikov__
,
@giffmana
You are warmly invited to join us at
#ECCV
for our poster at 14:00 today (UK time) or midnight...
“BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues" with
@gulvarol
,
@LiliMomeni
, T. Afouras, J.S Chung, N. Fox & A. Zisserman
Thanks to everyone who attended the CVPR workshop "The End-of-End-to-End: A Video Understanding Pentathlon"!
Links to papers and slides:
Video:
Congratulations to the challenge winners and thank you to all our presenters!
Key idea: compress a large dataset into a small set of synthetic images that can train networks to the same accuracy as the original dataset.
Was a pleasure to examine Bo's thesis on this topic work with
@driainmurray
.
2/7
Thanks to everyone who attended Neural Architects and made for such a wonderful workshop! Particularly Barret Zoph, Iasonas Kokkinos, Alan Yuille, Sara Sabour and Ross Girshick for their fantastic talks &
@Momenta_AI
for support. Slides (soon) at
#iccv2019
Beartype has long been one of my favourite open-source libraries
Because:
- it's a great library
- thanks to maintainer Cecil Curry (leycec) every GitHub issue thread is a work of literature
Some classics
PaLI-X: On Scaling up a Multilingual Vision and Language Model
- shows that scaling up both V&L brings gains
- with a massive vision encoder (22B), you can co-train for image classification and OCR
By X. Chen,
@neilhoulsby
,
@RSoricut
& others
- Dataset Distillation with Infinitely Wide Convolutional Networks by
@IAmTimNguyen
et al. (NeurIPS '21)
- Dataset Distillation by Matching Training Trajectories by
@GCazenavette
et al. (CVPR '22)
6/7
TLDR: If we
- train a powerful AI, and
- use current behavioural training approaches
things may go badly
An argument outlined by
@peterbarnettnz
and Jeremy Gillen
Paper:
Initial thoughts: Gemini Ultra is clearly an advance on Gemini Pro.
The Google Docs integration now seems far more useful (fewer hallucinations).
It's also quite fast.
Of course, it still has some way to go with algebra...
I'll be at NeurIPS this week.
DM if you'd like to meet up to discuss any of the following:
- AI-accelerated science
- foundation models
- compute budgets
- mince pies
Flan-PaLM 540B (PaLM 540B finetuned on instructions) makes major progress on MMLU.
Note: my previous graph () lacked some of the available SotA forecasts - that's updated below.
Even with the update, the numbers remain impressive.
3/12
Need help fact-checking ChatGPT?
Filtir is available in the ChatGPT plugin store!
Feedback v. welcome
Note: this is an early version, so you still need to be careful to double-check things yourself
Do you love videos? Do you love natural language?
Why not express those passions through a submission to our workshop on video retrieval from natural language queries!
Find out more about the workshop at
#CVPR2020
@CVPR2020
#video
#retrieval
#workshop
📼
Radix sort.
A glorious sorting algorithm.
Used at least as early as the 1890s by Herman Hollerith and his punched card machines.
Here's a video on how it works.
1/2
Flan-PaLM 540B (PaLM 540B finetuned on instructions) makes major progress on MMLU.
Note: my previous graph () lacked some of the available SotA forecasts - that's updated below.
Even with the update, the numbers remain impressive.
3/12
Just how striking are the recent language model results with Flan-PaLM?
Here's a plot.
Across 57 tasks on mathematics, US history, computer science etc., Flan-PaLM surpasses **both** the June 2023 and June 2024 SotA forecasts from this summer by competitive forecasters.
1/3
Much of this work is behind the scenes.
It does not receive the glory of creative code releases, popular preprints and dramatic demos.
And so, my dear Twitterverse, I am letting you know.
He is a wonderful colleague.
And a great educator.
3/3
@immazzystar
describes the high leverage that GPT-4 gives individuals:
- "The overnight surge in productivity is intoxicating"
@robkhenderson
explores implications of LLMs:
- "people will rely on them to learn what is permissible to say in polite society"
25/25
Statement: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”
Signed by:
- Turing Award winners
- AI researchers
- Hassabis, Altman, Amodei
- many more
@ai_risks
Multiagent debate
- use multiple LM instances to propose & debate over multiple rounds
- improves reasoning & factual accuracy
- complementary to chain-of-though etc.
Paper:
By
@du_yilun
,
@ShuangL13799063
,
@IMordatch
et al.
- Dataset Condensation with Contrastive Signals by S. Lee et al. (ICML '22)
- Dataset Condensation via Efficient Synthetic-Data Parameterization by J-H Kim et al. (ICML '22)
7/7
TLDR: Self-Discover prompting
- works out a reasoning strategy for a given task
- amortises the cost of that work across task instances
- brings gains over Chain-of-Thought
Work by
@peizNLP
et al.
Paper: