Here is our “slick” RLHF-alternative without RL: (SLiC-HF)
TL;DR: Works as well as RLHF, but a lot simpler.
About as easy and efficient as fine-tuning. Much better than simply fine-tuning on good examples.
From great collaborators:
@yaozhaoai
,…
The gpt-4 tokenizer is open source
If you look at the code, an interesting finding is the presence of special tokens FIM_*. This is probably for Fill-in-the-middle pretraining.
The greatest, most productive living mathematician is using LLMs to improve his work productivity ... in math. 🤯
"I could feed GPT-4 the first few PDF pages of a recent math preprint and get it to generate a half-dozen intelligent questions that an expert attending a talk on…
Terence Tao, the famous mathematician, on using LLMs to aid in mathematical research:
"2023-level AI can already generate suggestive hints and promising leads to a working mathematician and participate actively in the decision-making process. When integrated with tools such as…
We are hiring for a full-time researcher/engineer in the Brain (Google Research) team who will focus on text generation research and its applications. A wide variety of backgrounds and experiences will be considered. DM if you're interested or have leads.
My team has open-sourced a pure python implementation of ROUGE (Apache 2 license) that can be used as a replacement for the original perl version (which also had an ambiguous license).
@harvardnlp
@stanfordnlp
interesting paper on arxiv posted recently "Arrows of Time for Large Language Models"
TL;DR: it is easier for larger models to predict in forward direction (next-token), rather than backward (prev-token). The larger the model, the more pronounced the…
People are not well-calibrated on AI progress in mathematical reasoning.
GSM8K () is a common task testing basic grade-school math ability, but was only introduced in Oct 2021. Manifold markets only thought there was a ~50% chance that a system would get…
Sounds like OpenAI got some good numbers on GSM8K, possibly MATH.
Speculating, but there is a 'star' in STaR , a technique that fine-tunes a model to its own (better) outputs, which some people see as 'self-improvement'.
As generative language models hit production, there’s increased risk from bad outputs. It’s useful to know when to *not* show the outputs to the user, or defer to better, larger models (at the cost of compute). A 🧵on an ICLR 2023 paper from Google. (1/n)
People are realizing RLHF can be easy with DPO and SLiC-HF. If you were wondering how they compare, the answer is they are pretty similar and our paper ( led by
@Terenceliu4444
) shows the math.
The biggest question is whether you should train a preference…
Aligning LLMs with Human Preferences is one of the most active research areas🧪
RLHF, DPO, and SLiC are all techniques for aligning LLMs, but they come with challenges. 🥷
@GoogleDeepMind
proposes a new method, “Statistical Rejection Sampling Optimization (RSO)”
🧶
@karpathy
is perhaps the most talented deep learning teacher out there, and his video lectures are always worth watching.
Some minor addenda on the history of tokenization:
While GPT-2 used sub-word tokenization pretty early, it was really shown to be important for handling…
New (2h13m 😅) lecture: "Let's build the GPT Tokenizer"
Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and…
Had a look at RWKV. It's more like an Attention-Free Transformer (AFT) that can be viewed as an RNN for fast inference. The training code is written like a Transformer.
"Time-mixing" ~ AFT ~ linear attention replacement
"channel-mixing" ~ FFN - not sure this change is needed
GSM8K/MATH are great testbeds for self-improvement because model outputs can be evaluated for correctness more or less automatically (like Go). Thus there is a high-fidelity feedback signal that can improve models without humans.
For more open-ended generation, humans often…
All companies will train their own chatgpt/GPT4 thanks to open-source!
So cool to see this paper from Bloomberg, which is one of
@huggingface
’s favorite customers :)
Absolutely agree. Many researchers assume that a dataset is good because a lot of people use it, without really knowing the details about its provenance / quality.
One reason academic data is often of poor quality is that high-quality data is expensive to procure, and so data…
In fact, one of my big takeaways from the Ouyang et al 22 paper (instructgpt) paper was optimizing to public NLP dataset collection is counterproductive to deployment settings (as measured via human preferences).
It's quite possible that fine-tuning LLaMA () with this instruction-tuning dataset will get you very close to text-davinci-001 (InstructGPT) performance.
Open-source LLMs are going to improve rapidly!
For everyone building ChatGPT at home, there's now a very cool dataset on the Hub that allows you to train instruction models at comparable quality to OpenAI's InstructGPT 🤯
How long before someone trains a certain 🌸 or 🦙 on it?
Download it here 👉:
When you feel the AGI it’s mostly the G, for General. Old AI can easily beat LLMs at chess. The new AIs spend most of their existence/compute just observing the world, without being taught explicit skills, but when you ask them random questions it’s clear they’ve learned a lot of…
@Singularitarian
Language models are trained by taking a bunch of text, converting it into sequences of tokens, and learning to predict the next token from previous ones. This works because P(w_1, w_2, ..., w_m) = \prod_{i=1}^m P(w_i | w_{<i}) (chain rule).
New helpful AI-powered features coming to smart canvas in
@GoogleWorkspace
: automatically generated summaries, email draft + meeting notes templates in Docs, formula corrections in Sheets and more.
🚨Stop using positional encoding (PE) in Transformer decoders (e.g. GPTs). Our work shows 𝗡𝗼𝗣𝗘 (no positional encoding) outperforms all variants like absolute, relative, ALiBi, Rotary. A decoder can learn PE in its representation (see proof). Time for 𝗡𝗼𝗣𝗘 𝗟𝗟𝗠𝘀🧵[1/n]
The juxtaposition of
(a) downturn in general tech
vs
(b) boom in AI
is quite jarring.
"It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness ..."
-- Tale of Two Cities
Well that was faster than I expected.
"We introduce Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca behaves similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to…
It's quite possible that fine-tuning LLaMA () with this instruction-tuning dataset will get you very close to text-davinci-001 (InstructGPT) performance.
Open-source LLMs are going to improve rapidly!
Had mixed feelings about the term "Foundation Models", but have to admit that "FoMo" (h/t
@charles_rqi
) is the perfect abbreviation also capturing the zeitgeist of the ML research community.
Language models (LMs) exhibit harmful biases that can get worse with size. Reinforcement learning from human feedback (RLHF) helps, but not always enough. We show that simple prompting approaches can help LMs trained with RLHF produce less harmful outputs.
I wouldn't be surprised if pretraining with a focus on code confers benefits beyond using mainly natural language. Next token prediction for language is usually very local, whereas code often requires longer dependencies to do things like close brackets or refer to distant defs.
How did the initial
#GPT3
evolve to today's
#ChatGPT
? Where do the amazing abilities of
#GPT3
.5 come from? What is enabled by
#RLHF
? In this article with
@allen_ai
, we trace the emergent abilities of
#LLM
to their sources from first principles
I tend to think collecting human feedback is something the open community could excel at relative to big tech players. In particular you don't need a lot of concentrated compute, which is where the open community is most disadvantaged.
I believe that in 6-12 months we'll have an open source GPT-4 replication.
But GPT-5 will be built based on immense amounts of human feedback collected like shown here and I'm not sure how the open community will replicate that
The pricing of the ChatGPT API makes ChatGPT Plus look expensive at $20/month for most users.
Arbitrage opportunity: build a web-app using the API and charge less than Plus.
New SOTA results for abstractive summarization just posted to ! We have a new way to pre-train for summarization, and evaluated our PEGASUS model on 12 diverse downstream summarization tasks, achieving SOTA on all, in some cases by a significant margin.
@AlphaSignalAI
@gdb
Misplaced commas can often be found via unit tests or static checks.
With ML code it's more subtle. If you initialize a param with the wrong distribution, or if your tokenizer doesn't break strings up in the "right" way, you could get much worse results. The devil is really in…
Passing "Needle in a haystack" is not sufficient to say you solved long-context.
Possibly better test: checking the gap in performance between (a) fine-tuning and (b) putting the same number of examples in-context across a variety of datasets/tasks of varying complexity.
Great work by Trieu and really nice talk that I had the privilege to see a while ago internally at GDM.
What is interesting is this doesn't even use LLMs. The model is tiny (by today's standards), like small GPT-2. And it is solving problems that GPT-4 cannot. I imagine using…
$235m has been invested into Vector Databases in the past year:
-
@qdrant_engine
- $7.5m Seed
-
@tryChroma
- $18M Seed
-
@weaviate_io
- $50m Series A
-
@milvusio
- $60m Series B
-
@Pinecone
- $100m Series B
For reference, MongoDB raised $300m from start to $1.2b IPO.
I keep revisiting this great paper from
@andy_l_jones
: “Scaling scaling laws with board games”. It shows how training compute and inference compute of MCTS can be traded off against each other. 10x more MCTS steps is almost the same as training 10x more.
Most parents who enroll their kids in chess class don't actually care about chess performance.
"The GPT-4 pretraining dataset included chess games in the format of move sequence known as Portable Game Notation (PGN). We note that only games with players of Elo 1800 or higher…
While an understandable concern of using TPU is vendor lock-in, if you use Jax, it is quite easy to switch between TPU and GPU, e.g. training language models.
This wasn't always the case, but the excellent Jax team has achieved this with a lot of good work over the last year.
Incidentally, Google DeepMind recently published a paper in Nature making progress on his "favourite open question is the problem on the maximal size of a cap set":
Relevant blog post from Terry's blog:
Introducing FunSearch in
@Nature
: a method using large language models to search for new solutions in mathematics & computer science. 🔍
It pairs the creativity of an LLM with an automated evaluator to guard against hallucinations and incorrect ideas. 🧵
While "AI engineers" don't usually publish papers I still think you should cite them somehow if your method is significantly influenced by their work, e.g. open-source code.
Optimus can now sort objects autonomously 🤖
Its neural network is trained fully end-to-end: video in, controls out.
Come join to help develop Optimus (& improve its yoga routine 🧘)
→
Open models climbing AlpacaEval () are probably exploiting length bias of the auto-annotator.
There is always a challenge in optimizing reward that you're hacking the reward function and not what you want. If length-adjusted, some of these models are not…
Hmm, this seems to be the ChatGPT 4 pre-amble prompt:
"""
You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture.
Knowledge cutoff: 2023-04
Current date: 2024-01-11
Image input capabilities: Enabled
Tools
python
When you send a message…
Apparently some people prefer Waymo to human drivers who can be more unpredictable, and are willing to pay *more* than Uber.
Super-human AI can improve gross margins on both cost and price!
In advance of ICLR 2018 we've open-sourced the code for the tasks described in our paper "Generating Wikipedia by Summarizing Long Sequences" ( ). Go try it out:
Here's an interesting thought experiment to gain intuition on why it is often easier to predict 'forward' given knowledge of causality:
1. Forward: an elaborate ice sculpture (say a fancy castle) is left out on a hot day and melts. It is easy to predict that it'll end up as a…
Introducing AlphaGeometry: an AI system that solves Olympiad geometry problems at a level approaching a human gold-medalist. 📐
It was trained solely on synthetic data and marks a breakthrough for AI in mathematical reasoning. 🧵
Glad to see open-community take pre-training data seriously.
Another thing to beware of is de-duplication.
1. within training: to ensure you repeat data only intentionally
2. between training and eval: to ensure your eval is really held-out and you're measuring progress…
This take on the FineWeb release is one of the most interesting feedback and also a reason FineWeb is very different from even larger datasets like RedPajama-V2 (which is double its size!)
Surprisingly, the size of the dataset of 15T tokens is not very important, what is much…
Very cool work led by the talented
@mitchnw
. If you don't have access to huge amounts of compute but still want to contribute to language model research, read it! And stop sulking about the end of research.
🔥'Compression disproportionately impacts model performance on the underrepresented long-tail of the data distribution. Perhaps an explanation of the "bigger is better" race.' 🔥
This is fantastic. Full implementation of pruning identified exemplars and great walkthrough of how to audit the impact of compression techniques like pruning. 🎉🔥
We call this “Selective Generation” and propose a simple/cheap/effective way to do it.
We focus on cases where there is an input/output text, i.e. text2text, although it’s quite general, e.g. prompting (input) a language model for a response (output) is a special case. (2/n)
A few updates on the PEGASUS summarization work:
- Human raters don't prefer human summaries.
- We released code and checkpoints on GitHub.
- Work to appear at ICML2020.
Presenting PEGASUS, an approach to pre-training, that uses gap-sentence generation to improve the performance of fine-tuning for
#NaturalLanguageUnderstanding
tasks, like abstractive summarization. Read more and try the code for yourself ↓
All coding projects have two parts:
1. The fun part: where you get to "create"
2. The pain part: where you have to debug
Code LLMs are "automating" the fun parts while introducing bugs and not helping much with debugging. As a developer, you’re left with more pain to deal with.
A short note on how the way instruction-tuning is often done in open-source can actually encourage hallucination.
TL;DR: Some instruction-tuning needs to be model-specific, which is why you have to get your model in front of users.
I deleted this tweet because the “AI powered drone turns on its operator story” was total nonsense—the Colonel who described it as a simulation now says it was just “a thought experiment.”
😑