Cameron R. Wolfe, Ph.D. Profile Banner
Cameron R. Wolfe, Ph.D. Profile
Cameron R. Wolfe, Ph.D.

@cwolferesearch

22,518
Followers
633
Following
709
Media
3,501
Statuses

ML @Netflix • Writer @ Deep (Learning) Focus • PhD @optimalab1 • I make AI understandable

Joined August 2021
Don't wanna be here? Send us removal request.
Pinned Tweet
@cwolferesearch
Cameron R. Wolfe, Ph.D.
7 months
Q-Learning is *probably* not the secret to unlocking AGI. But, combining synthetic data generation (RLAIF, self-instruct, etc.) and data efficient reinforcement learning algorithms is likely the key to advancing the current paradigm of AI research… TL;DR: Finetuning with
Tweet media one
47
453
2K
@cwolferesearch
Cameron R. Wolfe, Ph.D.
10 months
Tweet media one
14
27
2K
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Large language models (LLMs) are fun to use, but understanding the fundamentals of how they work is also incredibly important. One major idea and building block of LLMs is their underlying architecture: the decoder-only transformer model. 🧵[1/6]
Tweet media one
42
386
2K
@cwolferesearch
Cameron R. Wolfe, Ph.D.
7 months
Due to the recent surge in popularity of AI and language models, one of the most common questions I hear is: How can we train a specialized LLM over our own data? The answer is actually pretty simple… TL;DR: Training LLMs end-to-end is quite difficult due to the size of the
Tweet media one
23
327
2K
@cwolferesearch
Cameron R. Wolfe, Ph.D.
10 months
One of the best ways to reduce hallucinations with LLMs is by retrieving useful, factual information and injecting it into the LLM’s prompt as added context. Although this might sound complicated, it’s actually quite easy to implement with standard vector search functionality…
Tweet media one
41
199
1K
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
The ChatGPT API was released yesterday and it costs 90% less than expected. Here’s five methods (and resources to learn about them) that are **probably** being used to enable this price reduction… 🧵[1/6]
Tweet media one
27
267
1K
@cwolferesearch
Cameron R. Wolfe, Ph.D.
4 months
RAG is one of the best (and easiest) ways to specialize an LLM over your own data, but successfully applying RAG in practice involves more than just stitching together pretrained models… What is RAG? At the highest level, RAG is a combination of a pretrained LLM with an
Tweet media one
19
267
1K
@cwolferesearch
Cameron R. Wolfe, Ph.D.
5 months
The volume of LLM research being released is staggering. Although there are too many new papers for any one person to read, this work can be largely distilled into a much smaller set of overlapping themes. Recently, there are three trends in LLM research that have been especially
Tweet media one
30
277
1K
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Self-attention is the primary building block of large language models (LLMs) and transformers in general. But, how exactly does it work? 🧵 [1/8]
Tweet media one
20
198
1K
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Although large language models (LLMs) are incredibly capable, they are pretty simple to understand. In fact, the core components of most LLMs can be distilled into five major components… 🧵[1/7]
Tweet media one
27
209
1K
@cwolferesearch
Cameron R. Wolfe, Ph.D.
5 months
Generative large language models (LLMs) are based upon the decoder-only transformer architecture. Currently, these types of generative LLMs are incredibly popular. However, I use encoder-only architectures for 90% of use cases as a practitioner. Here’s why… History of
Tweet media one
27
183
1K
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Nearly all recently-proposed large language models (LLMs) are based upon the decoder-only transformer architecture. But, is this always the best architecture to use? It depends… 🧵 [1/8]
Tweet media one
24
200
1K
@cwolferesearch
Cameron R. Wolfe, Ph.D.
6 months
Want to train a specialized LLM on your own data? The easiest way to do this is with low rank adaptation (LoRA), but many variants of LoRA exist. Here’s an overview of all (or at least most) of the techniques that are out there… LoRA models the update derived for a model’s
Tweet media one
16
214
968
@cwolferesearch
Cameron R. Wolfe, Ph.D.
2 months
LLaMA-3 is a prime example of why training a good LLM is almost entirely about data quality… TL;DR. Meta released LLaMA-3-8B/70B today and 95% of the technical info we have so far is related to data quality: - 15T tokens of pretraining data - More code during pretraining
Tweet media one
21
225
918
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Each “block” of a large language model (LLM) is comprised of self-attention and a feed-forward transformation. However, the exact self-attention variant used by LLMs is masked, multi-headed self-attention. Let’s break down what this means…🧵[1/11]
Tweet media one
9
158
889
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
After GPT-3 was proposed, a lot of research was done to find an even better language model. Initial attempts focused on just training larger models. Contrary to popular belief, however, there is more to creating a good language model than size… 🧵[1/8]
18
136
877
@cwolferesearch
Cameron R. Wolfe, Ph.D.
4 months
What’s the easiest way to specialize an LLM over your own data? Recent research has studied this problem in depth, and RAG is way more effective (and easier to implement) compared to extended pretraining or finetuning… Knowledge from pretraining. A lot of factual information is
Tweet media one
16
157
883
@cwolferesearch
Cameron R. Wolfe, Ph.D.
10 months
Have you ever wondered why all language models use decoder-only architectures? It's partially because decoder-only models work great for next-token prediction. However, recent research has also analyzed the choice of architecture for language models in depth... Decoder-only
Tweet media one
9
106
805
@cwolferesearch
Cameron R. Wolfe, Ph.D.
10 months
I just finished writing a survey on the history of open-source LLM research, spanning from the early days (e.g., OPT and BLOOM) to recent models like LLaMA-2. Here are three takeaways that seem to have the biggest impact on LLM quality… Base models make all the difference.
Tweet media one
19
146
795
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Large Language Models (LLMs) are notoriously bad at solving reasoning-based tasks. However, we can drastically improve their reasoning performance using simple techniques that require no fine-tuning or task-specific verifiers. Here’s how…🧵[1/7]
Tweet media one
18
127
724
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 month
Prompt engineering is one of the most rapidly-evolving research topics in AI, but we can (roughly) group recent research on this topic into four categories… (1) Reasoning: Simple prompting techniques are effective for many problems, but more sophisticated strategies are
Tweet media one
12
170
724
@cwolferesearch
Cameron R. Wolfe, Ph.D.
3 months
New language models get released every day (Gemini-1.5, Gemma, Claude 3, potentially GPT-5 etc. etc.), but one component of LLMs has remained constant over the last few years—the decoder-only transformer architecture. This architecture has five components… Why should we care?
Tweet media one
12
158
705
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Foundation models for language understanding (such as GPT-4) are becoming increasingly common and useful. But, what about other modalities? Today, Meta AI released the "Segment Anything" model, a foundation model for image segmentation... 🧵 [1/6]
Tweet media one
7
120
691
@cwolferesearch
Cameron R. Wolfe, Ph.D.
5 months
Given the popularity of retrieval augmented generation (RAG) for LLMs, one question I’m constantly asked is: What model should I use to embed my data for RAG? This question has a simple answer that I use for (almost) all applications… TL;DR: Sentence BERT (sBERT) is an
Tweet media one
24
116
685
@cwolferesearch
Cameron R. Wolfe, Ph.D.
5 months
Trying to create a language model that understands your own custom data? Here are techniques you can use to create a “specialized” LLM, ordered in terms of the amount of complexity/compute involved… TL;DR: When trying to solve problems with language models, we should start
Tweet media one
6
140
677
@cwolferesearch
Cameron R. Wolfe, Ph.D.
11 months
Given the current foundation model paradigm, I wonder if building/training models will become antiquated. Will future data scientists understand the details of optimization, architectures, etc.? ML may slowly be abstracted in favor of simpler (language model-based) solutions...
Tweet media one
22
95
648
@cwolferesearch
Cameron R. Wolfe, Ph.D.
7 months
I just wrote a long-form overview of RLHF, its origins/motivation, and the impact it has had on the generative AI movement. My conclusion? RLHF is (arguably) the key advancement that made modern generative LLMs possible. Here’s why… TL;DR: Prior to RLHF, we primary relied upon
Tweet media one
13
142
646
@cwolferesearch
Cameron R. Wolfe, Ph.D.
8 months
The creators of FlashAttention (makes language model training much faster) just released another awesome efficiency tool—FlashDecoding—that can make LLM inference up to 8X faster on long sequences. Here’s how it works… Background reading. To understand most of this post, you
Tweet media one
8
127
639
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Recently, I’ve read and overviewed publications for nearly 20 different large language models (LLMs) from GPT to ChatGPT. Here’s what I learned… 🧵 [1/10]
21
96
631
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Traditionally, LLMs have struggled to solve complex problems that require reasoning. Chain of thought prompting has improved their abilities in this domain, but why stop there? Here are four prompting techniques for solving difficult, multi-step problems with LLMs… 🧵 [1/8]
Tweet media one
14
132
613
@cwolferesearch
Cameron R. Wolfe, Ph.D.
10 months
Research on advanced prompting techniques for language models has extended chain of thought and tree of thought prompting to graph-structured reasoning processes. But, did you know that there are two versions of “graph of thought” prompting that have been proposed already? Some
Tweet media one
14
97
613
@cwolferesearch
Cameron R. Wolfe, Ph.D.
10 months
Using a KV cache is one of the most commonly-used tricks for speeding up inference with LLMs. Here’s exactly how it works… Autoregressive decoding process. When we perform inference with an LLM, it follows an autoregressive decoding process. Put simply, this means that we i)
Tweet media one
7
95
586
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Large Language Models (LLMs) commonly use a “greedy decoding” strategy to generate their output, but what exactly does this mean? Here’s how this process works… 🧵 [1/10]
Tweet media one
12
118
571
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
When we interact with language model APIs, such as the OpenAI API, we typically have to set a “temperature” parameter when obtaining output from the language model. But, what exactly is this parameter and how does it work? Let’s take a deeper look… The decoding process:
Tweet media one
14
106
568
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Prompt engineering for language models usually involves tweaking the wording or structure of a prompt. But, recent research has explored automated prompt engineering via continuous updates (e.g., via SGD) to a prompt’s embedding. Here’s how these techniques work… 🧵 [1/8]
Tweet media one
13
106
560
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Given the incredible performance of large language models (LLMs) like ChatGPT, it’s hard to believe that the original generative pre-trained transformer (GPT) was proposed less than five years ago. Here’s how we got to where we are right now… 🧵[1/8]
Tweet media one
9
96
550
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Many different (text-based) transformer architectures exist, but when and where should we use them? Here’s a quick list of four important transformer variants and the best applications to use them for…🧵[1/7]
Tweet media one
9
115
558
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Large language models (LLMs) have been criticized due to their heavy reliance on humans to create datasets for fine-tuning and RLHF, but recent research suggests that we might not even need humans for this… 🧵[1/9]
9
71
556
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Reinforcement learning from human feedback (RLHF) can teach LLMs a variety of interesting skills. As an example, Sparrow, a chatbot developed by @DeepMind , is taught (via RLHF) to support its factual claims by finding relevant information on Google... 🧵 [1/7]
3
78
555
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Vision Transformers (ViTs) are a powerful deep learning architecture, but what’s the difference between ViT and a text-based transformer like BERT? Despite being applied in completely different domains, these models have only one major difference… 🧵[1/7]
Tweet media one
4
101
546
@cwolferesearch
Cameron R. Wolfe, Ph.D.
10 months
Advanced prompting techniques allow language models to solve complex problems but are often constrained to a single line of reasoning. Tree of thoughts (ToT) prompting avoids this by deliberately decomposing, planning, and exploring candidate solutions to a problem via a
Tweet media one
14
95
551
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Foundation models are a popular topic in AI research. However, task-specific fine-tuning outperforms zero/few-shot learning with foundation models in most cases. Specialized models are hard to beat! Luckily, recent research indicates that we can combine the strengths of both
Tweet media one
11
114
544
@cwolferesearch
Cameron R. Wolfe, Ph.D.
7 months
Reinforcement learning from human feedback (RLHF) is a major catalyst of the recent generative AI boom, as it enables language models to surpass human writing quality. RLHF makes this possible by improving the alignment process in three main ways... What is RLHF? RLHF is a
Tweet media one
13
102
541
@cwolferesearch
Cameron R. Wolfe, Ph.D.
5 months
The impressive in-context learning abilities of LLMs has created the need for larger context windows. Recently, researchers discovered that we can easily extend the context window of a pretrained LLM with one simple trick (and no extra training)… What is the context window?
Tweet media one
9
106
542
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Most high-performing large language models (LLMs) are closed-source and can only be accessed via paid APIs. However, the public release of LLaMA has recently challenged this trend. Here’s what you need to know about LLaMA… 🧵[1/7]
Tweet media one
9
100
539
@cwolferesearch
Cameron R. Wolfe, Ph.D.
3 months
Retrieval-augmented generation (RAG) is the best way to specialize an LLM over your own data. Researchers have recently discovered a finetuning approach that makes LLMs much better at RAG... RAFT and specializing LLMs. Most use cases with LLMs require specializing the model to
Tweet media one
8
121
536
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
This is huge news! The number of times I've been asked "How difficult would it be to create a ChatGPT for <insert domain>?" is nearly countless. I'm sure versions of ChatGPT for retail, banking, insurance, and more will soon be available. [1/3]
8
87
530
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Object detection is a fundamental problem in computer vision. Although Vision Transformers (ViTs) achieve state-of-the-art performance today, the history of object detection proceeded in three distinct generations of innovation… 🧵 [1/7]
Tweet media one
11
100
531
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
In the wake of LLaMA, the deep learning research community quickly adopted the view that open-source LLMs will rule the future—reproducing open-source variants of proprietary models seemed to be easy and cheap. Is this the truth? Here’s a brief timeline of model proposals and
Tweet media one
17
101
526
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Can large language models (LLMs) train themselves? The explosion of imitation-based open-source LLMs drew criticism due to cursory evaluation that covered up performance gaps. However, recent research shows powerful open-source LLMs can actually be created by imitating other
Tweet media one
11
101
525
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Reinforcement Learning from Human Feedback (RLHF) is a valuable fine-tuning technique, but people often misunderstand how it works and the impact that it has on LLM behavior. Meta's LIMA publication provides a lot of information that puts the value of RLHF into perspective...
Tweet media one
18
107
510
@cwolferesearch
Cameron R. Wolfe, Ph.D.
3 months
Now that Grok-1 has been released, it’s the perfect time to brush up on how Mixture-of-Experts (MoE) layers work in LLMs. Here’s a quick explainer… TL;DR: When applied to transformer models, MoE layers have two primary components: - Sparse MoE Layer: replaces dense
Tweet media one
9
110
503
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Is Attention really all we need? The answer seems to be yes, but why is this the case? Here’s the two main problems that transformers solved, which enabled many of the breakthroughs in natural language processing that we see today… 🧵[1/6]
Tweet media one
8
55
498
@cwolferesearch
Cameron R. Wolfe, Ph.D.
10 months
Powerful LLMs like GPT-4 can follow complex instructions, but building applications with less capable LLMs requires breaking a single, detailed instruction into a “chain” of simpler prompts. Here’s an overview of practically useful chaining techniques for LLMs... Some
Tweet media one
9
92
495
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Reinforcement learning from human feedback (RLHF) has gained recent popularity due to its ability to refine and improve the behavior of large language models. Recently, this framework has been extended to improve the quality of video game AIs. Here’s how… 🧵[1/8]
4
70
487
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
The Falcon-7B/40B open-source LLMs were released late this week, and their performance is super impressive. But, there's a huge catch for those using them commercially! Here's my main takeaways from the models so far... model architecture. The Falcon models were released by
Tweet media one
18
68
481
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
The MPT suite of large language models (LLMs) by MosaicML has become incredibly popular. But, what makes these models so special? Although there are a variety of reasons for the popularity of MPT, I find these models to be especially useful due to a few unique components… Fully
Tweet media one
10
96
479
@cwolferesearch
Cameron R. Wolfe, Ph.D.
11 months
LLaMA-2 outlines the remaining limitations of open-source language models well. Put simply, the gap in performance between open-source and proprietary LLMs is largely due to the quality of alignment. However, LLaMA-2 takes a major step in the right direction… State of
Tweet media one
14
83
467
@cwolferesearch
Cameron R. Wolfe, Ph.D.
10 months
I have recently given some long-form lectures on language models, how they work, and the AI landscape, which has given me a chance to more clearly organize key concepts for understanding language models. Here are the 15 key concepts that I’ve arrived at so far… AI
Tweet media one
11
81
455
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 month
Recently, I’ve run hundreds of instruction tuning experiments with LoRA/QLoRA, and I wanted to share some (basic) code and findings that might be useful… The code (see replies) contains an instruction tuning script using LoRA/QLoRA and the Alpaca dataset, as well as evaluation
Tweet media one
22
78
461
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Following the release of LLaMA, we saw a rapid explosion of open-source research on large language models (LLMs). Here are the three most notable model releases during this time… 🧵 [1/8]
Tweet media one
12
79
448
@cwolferesearch
Cameron R. Wolfe, Ph.D.
9 months
Almost all generative language models use a decoder-only transformer architecture, making the decoder-only transformer one of the most influential architectures in modern AI. Let’s take a deeper look at an implementation to understand exactly how it works… Implementation
Tweet media one
6
60
442
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
The PaLM API was recently released (to select developers) by Google to compete with the ChatGPT API by OpenAI. Here’s the five main things you need to know about PaLM… 🧵 [1/7]
Tweet media one
14
67
436
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 month
Here is a (brief) taxonomy of the three advanced prompt engineering techniques that are most commonly used/referenced… Disclaimer: Basic prompting techniques (e.g., zero/few-shot or instruction prompting) are highly effective, but sometimes more complex prompts can be useful
Tweet media one
Tweet media two
Tweet media three
Tweet media four
8
103
443
@cwolferesearch
Cameron R. Wolfe, Ph.D.
2 years
The transformer is a foundational deep learning tool that is useful for a variety of tasks. One of the coolest applications of transformers (in my opinion) is for multi-object tracking in video. Here's how it works ... 🧵[1/7]
6
61
427
@cwolferesearch
Cameron R. Wolfe, Ph.D.
4 months
Having the ability to clearly explain fundamental concepts in AI to others is incredibly important. To explain large language models (LLMs), I use a simple three-part framework… Why is this important? Given that most AI engineers/researchers work on teams with highly-technical
6
81
433
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
BERT made transfer learning popular in NLP, but follow-up research proposed a ton of new techniques for transfer learning with large language models (LLMs). T5 analyzed these techniques using a unified format. Here’s what we learn from this… 🧵 [1/9]
7
73
424
@cwolferesearch
Cameron R. Wolfe, Ph.D.
11 months
Can generative models create their own training data? Recent research indicates that we should be careful with doing this! For image generation models especially, there seems to be a reasonable risk of degradation (or even a complete collapse) in performance… What is
Tweet media one
20
91
420
@cwolferesearch
Cameron R. Wolfe, Ph.D.
2 months
Research on LLMs is moving quickly, and even models / techniques that have been state-of-the-art for a long time (e.g., GPT-4 and Mixtral) are being quickly dethroned. Here’s a list of my top ten AI developments (each with a brief summary) over the last few months… [1] DBRX is
Tweet media one
Tweet media two
Tweet media three
Tweet media four
9
104
419
@cwolferesearch
Cameron R. Wolfe, Ph.D.
10 months
The recent success of LLaMA-2, which can be attributed to a variety of factors, clearly demonstrates the massive value of reinforcement learning from human feedback (RLHF). Here’s what the authors of LLaMA have to say about why RLHF is so important… Collecting data for RLHF.
Tweet media one
6
73
414
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Next-token prediction is the workhorse behind all modern advancements in large language models (LLMs) due to its use in training these models over unlabeled text. But, how exactly does this next-token prediction (or language modeling) objective work? Let’s take a deeper look…
Tweet media one
11
68
406
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Large Language Models (LLMs) make awesome foundation models and can be re-purposed for solving a variety of tasks. But, how can we specialize generic LLMs to solve more domain-specific problems? Currently, there are three main approaches…🧵[1/8]
5
61
408
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Diffusion models (DMs) are SOTA for generative modeling of images and video, but their typical formulation requires hundreds of GPU days for training. Stable Diffusion fixed this. Here’s how… 🧵 [1/8]
7
49
405
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
The foundation series by MosaicML, including MPT-7B/30B (and an efficient training repo), makes high-quality pre-trained language models available to anyone for commercial use. Given that creating a pre-trained base model is incredibly expensive, these open-source tools enable a
Tweet media one
16
66
400
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Recent research on language models has aimed to increase the maximum allowable context length of the underlying model. But, how can we enable an LLM to handle longer inputs? One way is through the use of ALiBi… Vanilla position embeddings. Decoder-only transformer architectures
Tweet media one
6
60
400
@cwolferesearch
Cameron R. Wolfe, Ph.D.
8 months
Retrieval Augmented Generation (RAG) is a popular tool for improving the quality/factuality of LLMs. Self-RAG makes RAG smarter by teaching the LLM to reflect and decide which components of RAG actually help with answering a prompt… TL;DR: RAG is highly effective, but it’s a
Tweet media one
7
70
392
@cwolferesearch
Cameron R. Wolfe, Ph.D.
3 months
There are a ton of different ways to finetune a language model. Here's a (brief) summary of language model finetuning, the various approaches that exist, their purpose, and what we know about how they work... Finetuning techniques: The term “finetuning” simply refers to further
Tweet media one
7
88
396
@cwolferesearch
Cameron R. Wolfe, Ph.D.
4 months
Retrieval augmented generation (RAG) was proposed in 2020, but the idea has since been explored and expanded by a variety of papers. Here are four notable publications that study advanced concepts with RAG… (0) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks:
Tweet media one
5
91
390
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Prompt engineering is oftentimes an annoying and brittle process. A small tweak to a prompt could massively change an LLM's output. But, it doesn’t have to be this way! We can adopt techniques like prompt ensembles to improve LLM reliability. 🧵 [1/10]
Tweet media one
6
77
386
@cwolferesearch
Cameron R. Wolfe, Ph.D.
8 months
Most intro paragraphs for AI/ML papers just re-state the same, basic info about AI. But, the recent "GPT-4 Doesn’t Know It’s Wrong" paper has one of the best intros I've ever read... "Large Language Models (LLMs), essentially n-gram models on steroids which have been trained on
Tweet media one
13
67
388
@cwolferesearch
Cameron R. Wolfe, Ph.D.
5 months
Most businesses are interested in training a specialized LLM on their own data. However, exposing proprietary data to an LLM is a security risk. Can we ensure that the LLM’s training data will not be leaked? Recent research indicates that the answer is no… TL;DR: Recent
Tweet media one
10
82
379
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 month
LLMs are cool, but getting married was a lot cooler! Thank you everyone for not releasing any new models over the weekend. It was nice to fully disconnect and celebrate with my friends and family! ��️
Tweet media one
Tweet media two
Tweet media three
Tweet media four
50
7
381
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Instruction fine-tuning (or instruction tuning for short) is an incredibly useful method for creating high-performing large language models (LLMs). Here are 3 key ideas you need to know about it…🧵[1/7]
Tweet media one
4
85
373
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
One of the main benefits of GPT-4 relative to prior models (like ChatGPT/GPT-3.5) is that the model is incredibly steerable. Here’s what this means and how you can use it to create better chat experiences… 🧵[1/8]
Tweet media one
11
66
369
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Large Language Models (LLMs) have the potential to be incredibly useful, but they also make a lot of mistakes (e.g., by generating false or biased information). To eliminate this behavior, recent generations of LLMs utilize a two-part refinement process… 🧵 [1/10]
7
56
363
@cwolferesearch
Cameron R. Wolfe, Ph.D.
6 months
I’ve spent the last ~5 years working on (and writing about) language models. The proposal of Google Gemini made me think about why I am so interested in these models. There are numerous reasons, but the allure of LLMs (at least for me) boils down to 3 core properties… TL;DR:
Tweet media one
7
52
364
@cwolferesearch
Cameron R. Wolfe, Ph.D.
6 months
The mixture of pretraining data used for Gemini was excluded from the technical report. Data mixology truly seems to be the new black magic for building effective AI systems. But, Gemini does give us a few important data-related learnings... (1) Diverse sources: Whenever
Tweet media one
9
68
360
@cwolferesearch
Cameron R. Wolfe, Ph.D.
3 months
Masked self-attention is the key building block that allows LLMs to learn rich relationships and patterns between the words of a sentence. Let’s build it together from scratch… The big picture: Large language models are based upon a deep neural network architecture called a
Tweet media one
5
70
363
@cwolferesearch
Cameron R. Wolfe, Ph.D.
6 months
Looking for something to talk to your family about while you’re home for the holidays? Why not give them a clear, accessible explanation of ChatGPT? Here’s a simple, three-part framework that you can use to explain generative language models to (almost) anyone… TL;DR: We can
Tweet media one
9
55
359
@cwolferesearch
Cameron R. Wolfe, Ph.D.
8 months
We’ve seen a massive amount of progress in AI/LLM research over the last several weeks. Here are the five highest-impact papers/projects that I’ve been focusing on recently… StreamingLLM solves limitations with LLMs generating long sequences of text. To avoid excessive memory
Tweet media one
4
80
349
@cwolferesearch
Cameron R. Wolfe, Ph.D.
11 months
Recently proposed open-source language models have placed an emphasis upon inference speed. Such work has shown us that inference speed can be improved by up to 5X (or more) by making some changes to the decoder-only transformer architecture. Here are three examples that have
Tweet media one
8
76
351
@cwolferesearch
Cameron R. Wolfe, Ph.D.
5 months
Recently-proposed large language models (LLMs) such as Google Gemini are structured and trained in a manner that maximizes efficiency and boosts training stability. But, what common tricks are used to achieve these efficiency/stability benefits? TL;DR: Making LLMs more
Tweet media one
2
66
342
@cwolferesearch
Cameron R. Wolfe, Ph.D.
11 months
@LukeGessler Pretty cool. Reminds me of using JPEG directly as input for image recognition with neural nets. I bet there's a lot of cool tricks like this out there that we haven't found yet.
6
44
348
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Recent research in open-source LLMs has made paid APIs much less enticing (though not hosting your own model is still nice). So much is possible if we are willing to fine-tune on some task-specific data! Here are a few examples to support my point... 🧵[1/6]
8
55
345
@cwolferesearch
Cameron R. Wolfe, Ph.D.
9 months
Next token prediction is the workhorse of causal language models. Despite recent advancements, LLMs' capabilities are largely attributable to next token prediction. To better grasp this concept, let’s study an implementation of next token prediction for LLM pretaining...
Tweet media one
7
63
330
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
The team at @CohereAI just released an awesome API endpoint (called Rerank) that can easily improve search and recommendation offerings by using LLMs. Here's what you need to know... Some background: Most search engines follow a two-step process. 1. Filtering: a rough/efficient
Tweet media one
12
52
329
@cwolferesearch
Cameron R. Wolfe, Ph.D.
2 years
What's the best learning rate schedule to use for training a neural net? This is a simple question that pretty much any deep learning practitioner will ask. I argue that cyclical LR schedules are most practical. Here's why... 🧵 [1/6]
7
53
328
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
Plugins are now supported by ChatGPT. These plugins will basically provide an "App Store" or ecosystem of useful tools that integrate easily with ChatGPT. Here's why this is a massive opportunity for AI developers. Many successful companies have been built by offering apps or
Tweet media one
5
55
327
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
We all know that LLMs tend to make errors, whether it be simple mistakes (e.g., improper arithmetic), hallucinations, or something else. But, studying the statistics of mistakes that LLMs make shows us something that we might not intuitively expect. Background: One way to study
Tweet media one
8
51
323
@cwolferesearch
Cameron R. Wolfe, Ph.D.
1 year
I just found a super useful stable diffusion CLI that can efficiently perform image generation, masking, editing, outpainting, and more with no coding required. Here's some cool stuff you can do with it... 🧵[1/7]
5
59
325