Cameron R. Wolfe, Ph.D. @cwolferesearch Twitter profile | Pikagi

Pikagi

Cameron R. Wolfe, Ph.D.

@cwolferesearch

22,518

Followers

633

Following

709

Media

3,501

Statuses

ML @Netflix • Writer @ Deep (Learning) Focus • PhD @optimalab1 • I make AI understandable

https://t.co/j75fAdLpp8

Joined August 2021

Don't wanna be here? Send us removal request.

Pinned Tweet

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

7 months

Q-Learning is *probably* not the secret to unlocking AGI. But, combining synthetic data generation (RLAIF, self-instruct, etc.) and data efficient reinforcement learning algorithms is likely the key to advancing the current paradigm of AI research… TL;DR: Finetuning with

Tweet media one

47

453

2K

Last Seen Profiles

@kailangxiaob

@catnipgames

@sprak_kham

@teppu_tokyo

@UmiLilikP

@orizqasativaz

@simondavies619

@Sakutaro1917

@AlisaCutiepie

@snwu8s

@sitijamik

@23Hijab

@saadal7ubiny

@dirtn3rd

@conaf_minagri

@YU_W0N

@UCSFENDOFELLOWS

@jandakembangstw

@mikerod_sd

@ochaseikatsu

@ClintKern41471

@MonsurA20628089

@RedSheSaidOG

@Deavenallison

@MorkepDragon

@rylanbuschell

@erhegradcluster

@carlyswiley

@Akbarhussain160

@ossukurisu

@lail12755

@Fla_Pol

@EricBougaud

@AdrianaMiceli1

@stw_pdg

@PasSocialMad

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

@MerabDvalishvil

Tweet media one

14

27

2K

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Large language models (LLMs) are fun to use, but understanding the fundamentals of how they work is also incredibly important. One major idea and building block of LLMs is their underlying architecture: the decoder-only transformer model. 🧵[1/6]

Tweet media one

42

386

2K

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

7 months

Due to the recent surge in popularity of AI and language models, one of the most common questions I hear is: How can we train a specialized LLM over our own data? The answer is actually pretty simple… TL;DR: Training LLMs end-to-end is quite difficult due to the size of the

Tweet media one

23

327

2K

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

One of the best ways to reduce hallucinations with LLMs is by retrieving useful, factual information and injecting it into the LLM’s prompt as added context. Although this might sound complicated, it’s actually quite easy to implement with standard vector search functionality…

Tweet media one

41

199

1K

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

The ChatGPT API was released yesterday and it costs 90% less than expected. Here’s five methods (and resources to learn about them) that are **probably** being used to enable this price reduction… 🧵[1/6]

Tweet media one

27

267

1K

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

4 months

RAG is one of the best (and easiest) ways to specialize an LLM over your own data, but successfully applying RAG in practice involves more than just stitching together pretrained models… What is RAG? At the highest level, RAG is a combination of a pretrained LLM with an

Tweet media one

19

267

1K

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

5 months

The volume of LLM research being released is staggering. Although there are too many new papers for any one person to read, this work can be largely distilled into a much smaller set of overlapping themes. Recently, there are three trends in LLM research that have been especially

Tweet media one

30

277

1K

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Self-attention is the primary building block of large language models (LLMs) and transformers in general. But, how exactly does it work? 🧵 [1/8]

Tweet media one

20

198

1K

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Although large language models (LLMs) are incredibly capable, they are pretty simple to understand. In fact, the core components of most LLMs can be distilled into five major components… 🧵[1/7]

Tweet media one

27

209

1K

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

5 months

Generative large language models (LLMs) are based upon the decoder-only transformer architecture. Currently, these types of generative LLMs are incredibly popular. However, I use encoder-only architectures for 90% of use cases as a practitioner. Here’s why… History of

Tweet media one

27

183

1K

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Nearly all recently-proposed large language models (LLMs) are based upon the decoder-only transformer architecture. But, is this always the best architecture to use? It depends… 🧵 [1/8]

Tweet media one

24

200

1K

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

6 months

Want to train a specialized LLM on your own data? The easiest way to do this is with low rank adaptation (LoRA), but many variants of LoRA exist. Here’s an overview of all (or at least most) of the techniques that are out there… LoRA models the update derived for a model’s

Tweet media one

16

214

968

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 months

LLaMA-3 is a prime example of why training a good LLM is almost entirely about data quality… TL;DR. Meta released LLaMA-3-8B/70B today and 95% of the technical info we have so far is related to data quality: - 15T tokens of pretraining data - More code during pretraining

Tweet media one

21

225

918

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Each “block” of a large language model (LLM) is comprised of self-attention and a feed-forward transformation. However, the exact self-attention variant used by LLMs is masked, multi-headed self-attention. Let’s break down what this means…🧵[1/11]

Tweet media one

9

158

889

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

After GPT-3 was proposed, a lot of research was done to find an even better language model. Initial attempts focused on just training larger models. Contrary to popular belief, however, there is more to creating a good language model than size… 🧵[1/8]

18

136

877

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

4 months

What’s the easiest way to specialize an LLM over your own data? Recent research has studied this problem in depth, and RAG is way more effective (and easier to implement) compared to extended pretraining or finetuning… Knowledge from pretraining. A lot of factual information is

Tweet media one

16

157

883

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

Have you ever wondered why all language models use decoder-only architectures? It's partially because decoder-only models work great for next-token prediction. However, recent research has also analyzed the choice of architecture for language models in depth... Decoder-only

Tweet media one

9

106

805

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

I just finished writing a survey on the history of open-source LLM research, spanning from the early days (e.g., OPT and BLOOM) to recent models like LLaMA-2. Here are three takeaways that seem to have the biggest impact on LLM quality… Base models make all the difference.

Tweet media one

19

146

795

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Large Language Models (LLMs) are notoriously bad at solving reasoning-based tasks. However, we can drastically improve their reasoning performance using simple techniques that require no fine-tuning or task-specific verifiers. Here’s how…🧵[1/7]

Tweet media one

18

127

724

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 month

Prompt engineering is one of the most rapidly-evolving research topics in AI, but we can (roughly) group recent research on this topic into four categories… (1) Reasoning: Simple prompting techniques are effective for many problems, but more sophisticated strategies are

Tweet media one

12

170

724

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

3 months

New language models get released every day (Gemini-1.5, Gemma, Claude 3, potentially GPT-5 etc. etc.), but one component of LLMs has remained constant over the last few years—the decoder-only transformer architecture. This architecture has five components… Why should we care?

Tweet media one

12

158

705

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Foundation models for language understanding (such as GPT-4) are becoming increasingly common and useful. But, what about other modalities? Today, Meta AI released the "Segment Anything" model, a foundation model for image segmentation... 🧵 [1/6]

Tweet media one

7

120

691

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

5 months

Given the popularity of retrieval augmented generation (RAG) for LLMs, one question I’m constantly asked is: What model should I use to embed my data for RAG? This question has a simple answer that I use for (almost) all applications… TL;DR: Sentence BERT (sBERT) is an

Tweet media one

24

116

685

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

5 months

Trying to create a language model that understands your own custom data? Here are techniques you can use to create a “specialized” LLM, ordered in terms of the amount of complexity/compute involved… TL;DR: When trying to solve problems with language models, we should start

Tweet media one

6

140

677

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

11 months

Given the current foundation model paradigm, I wonder if building/training models will become antiquated. Will future data scientists understand the details of optimization, architectures, etc.? ML may slowly be abstracted in favor of simpler (language model-based) solutions...

Tweet media one

22

95

648

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

7 months

I just wrote a long-form overview of RLHF, its origins/motivation, and the impact it has had on the generative AI movement. My conclusion? RLHF is (arguably) the key advancement that made modern generative LLMs possible. Here’s why… TL;DR: Prior to RLHF, we primary relied upon

Tweet media one

13

142

646

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

8 months

The creators of FlashAttention (makes language model training much faster) just released another awesome efficiency tool—FlashDecoding—that can make LLM inference up to 8X faster on long sequences. Here’s how it works… Background reading. To understand most of this post, you

Tweet media one

8

127

639

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Recently, I’ve read and overviewed publications for nearly 20 different large language models (LLMs) from GPT to ChatGPT. Here’s what I learned… 🧵 [1/10]

21

96

631

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Traditionally, LLMs have struggled to solve complex problems that require reasoning. Chain of thought prompting has improved their abilities in this domain, but why stop there? Here are four prompting techniques for solving difficult, multi-step problems with LLMs… 🧵 [1/8]

Tweet media one

14

132

613

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

Research on advanced prompting techniques for language models has extended chain of thought and tree of thought prompting to graph-structured reasoning processes. But, did you know that there are two versions of “graph of thought” prompting that have been proposed already? Some

Tweet media one

14

97

613

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

Using a KV cache is one of the most commonly-used tricks for speeding up inference with LLMs. Here’s exactly how it works… Autoregressive decoding process. When we perform inference with an LLM, it follows an autoregressive decoding process. Put simply, this means that we i)

Tweet media one

7

95

586

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Large Language Models (LLMs) commonly use a “greedy decoding” strategy to generate their output, but what exactly does this mean? Here’s how this process works… 🧵 [1/10]

Tweet media one

12

118

571

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

When we interact with language model APIs, such as the OpenAI API, we typically have to set a “temperature” parameter when obtaining output from the language model. But, what exactly is this parameter and how does it work? Let’s take a deeper look… The decoding process:

Tweet media one

14

106

568

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Prompt engineering for language models usually involves tweaking the wording or structure of a prompt. But, recent research has explored automated prompt engineering via continuous updates (e.g., via SGD) to a prompt’s embedding. Here’s how these techniques work… 🧵 [1/8]

Tweet media one

13

106

560

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Given the incredible performance of large language models (LLMs) like ChatGPT, it’s hard to believe that the original generative pre-trained transformer (GPT) was proposed less than five years ago. Here’s how we got to where we are right now… 🧵[1/8]

Tweet media one

9

96

550

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Many different (text-based) transformer architectures exist, but when and where should we use them? Here’s a quick list of four important transformer variants and the best applications to use them for…🧵[1/7]

Tweet media one

9

115

558

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Large language models (LLMs) have been criticized due to their heavy reliance on humans to create datasets for fine-tuning and RLHF, but recent research suggests that we might not even need humans for this… 🧵[1/9]

9

71

556

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Reinforcement learning from human feedback (RLHF) can teach LLMs a variety of interesting skills. As an example, Sparrow, a chatbot developed by @DeepMind , is taught (via RLHF) to support its factual claims by finding relevant information on Google... 🧵 [1/7]

3

78

555

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Vision Transformers (ViTs) are a powerful deep learning architecture, but what’s the difference between ViT and a text-based transformer like BERT? Despite being applied in completely different domains, these models have only one major difference… 🧵[1/7]

Tweet media one

4

101

546

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

Advanced prompting techniques allow language models to solve complex problems but are often constrained to a single line of reasoning. Tree of thoughts (ToT) prompting avoids this by deliberately decomposing, planning, and exploring candidate solutions to a problem via a

Tweet media one

14

95

551

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Foundation models are a popular topic in AI research. However, task-specific fine-tuning outperforms zero/few-shot learning with foundation models in most cases. Specialized models are hard to beat! Luckily, recent research indicates that we can combine the strengths of both

Tweet media one

11

114

544

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

7 months

Reinforcement learning from human feedback (RLHF) is a major catalyst of the recent generative AI boom, as it enables language models to surpass human writing quality. RLHF makes this possible by improving the alignment process in three main ways... What is RLHF? RLHF is a

Tweet media one

13

102

541

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

5 months

The impressive in-context learning abilities of LLMs has created the need for larger context windows. Recently, researchers discovered that we can easily extend the context window of a pretrained LLM with one simple trick (and no extra training)… What is the context window?

Tweet media one

9

106

542

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Most high-performing large language models (LLMs) are closed-source and can only be accessed via paid APIs. However, the public release of LLaMA has recently challenged this trend. Here’s what you need to know about LLaMA… 🧵[1/7]

Tweet media one

9

100

539

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

3 months

Retrieval-augmented generation (RAG) is the best way to specialize an LLM over your own data. Researchers have recently discovered a finetuning approach that makes LLMs much better at RAG... RAFT and specializing LLMs. Most use cases with LLMs require specializing the model to

Tweet media one

8

121

536

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

This is huge news! The number of times I've been asked "How difficult would it be to create a ChatGPT for <insert domain>?" is nearly countless. I'm sure versions of ChatGPT for retail, banking, insurance, and more will soon be available. [1/3]

8

87

530

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Object detection is a fundamental problem in computer vision. Although Vision Transformers (ViTs) achieve state-of-the-art performance today, the history of object detection proceeded in three distinct generations of innovation… 🧵 [1/7]

Tweet media one

11

100

531

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

In the wake of LLaMA, the deep learning research community quickly adopted the view that open-source LLMs will rule the future—reproducing open-source variants of proprietary models seemed to be easy and cheap. Is this the truth? Here’s a brief timeline of model proposals and

Tweet media one

17

101

526

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Can large language models (LLMs) train themselves? The explosion of imitation-based open-source LLMs drew criticism due to cursory evaluation that covered up performance gaps. However, recent research shows powerful open-source LLMs can actually be created by imitating other

Tweet media one

11

101

525

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Reinforcement Learning from Human Feedback (RLHF) is a valuable fine-tuning technique, but people often misunderstand how it works and the impact that it has on LLM behavior. Meta's LIMA publication provides a lot of information that puts the value of RLHF into perspective...

Tweet media one

18

107

510

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

3 months

Now that Grok-1 has been released, it’s the perfect time to brush up on how Mixture-of-Experts (MoE) layers work in LLMs. Here’s a quick explainer… TL;DR: When applied to transformer models, MoE layers have two primary components: - Sparse MoE Layer: replaces dense

Tweet media one

9

110

503

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Is Attention really all we need? The answer seems to be yes, but why is this the case? Here’s the two main problems that transformers solved, which enabled many of the breakthroughs in natural language processing that we see today… 🧵[1/6]

Tweet media one

8

55

498

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

Powerful LLMs like GPT-4 can follow complex instructions, but building applications with less capable LLMs requires breaking a single, detailed instruction into a “chain” of simpler prompts. Here’s an overview of practically useful chaining techniques for LLMs... Some

Tweet media one

9

92

495

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Reinforcement learning from human feedback (RLHF) has gained recent popularity due to its ability to refine and improve the behavior of large language models. Recently, this framework has been extended to improve the quality of video game AIs. Here’s how… 🧵[1/8]

4

70

487

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

The Falcon-7B/40B open-source LLMs were released late this week, and their performance is super impressive. But, there's a huge catch for those using them commercially! Here's my main takeaways from the models so far... model architecture. The Falcon models were released by

Tweet media one

18

68

481

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

The MPT suite of large language models (LLMs) by MosaicML has become incredibly popular. But, what makes these models so special? Although there are a variety of reasons for the popularity of MPT, I find these models to be especially useful due to a few unique components… Fully

Tweet media one

10

96

479

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

11 months

LLaMA-2 outlines the remaining limitations of open-source language models well. Put simply, the gap in performance between open-source and proprietary LLMs is largely due to the quality of alignment. However, LLaMA-2 takes a major step in the right direction… State of

Tweet media one

14

83

467

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

I have recently given some long-form lectures on language models, how they work, and the AI landscape, which has given me a chance to more clearly organize key concepts for understanding language models. Here are the 15 key concepts that I’ve arrived at so far… AI

Tweet media one

11

81

455

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 month

Recently, I’ve run hundreds of instruction tuning experiments with LoRA/QLoRA, and I wanted to share some (basic) code and findings that might be useful… The code (see replies) contains an instruction tuning script using LoRA/QLoRA and the Alpaca dataset, as well as evaluation

Tweet media one

22

78

461

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Following the release of LLaMA, we saw a rapid explosion of open-source research on large language models (LLMs). Here are the three most notable model releases during this time… 🧵 [1/8]

Tweet media one

12

79

448

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

9 months

Almost all generative language models use a decoder-only transformer architecture, making the decoder-only transformer one of the most influential architectures in modern AI. Let’s take a deeper look at an implementation to understand exactly how it works… Implementation

Tweet media one

6

60

442

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

The PaLM API was recently released (to select developers) by Google to compete with the ChatGPT API by OpenAI. Here’s the five main things you need to know about PaLM… 🧵 [1/7]

Tweet media one

14

67

436

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 month

Here is a (brief) taxonomy of the three advanced prompt engineering techniques that are most commonly used/referenced… Disclaimer: Basic prompting techniques (e.g., zero/few-shot or instruction prompting) are highly effective, but sometimes more complex prompts can be useful

Tweet media one

Tweet media two

Tweet media three

Tweet media four

8

103

443

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

The transformer is a foundational deep learning tool that is useful for a variety of tasks. One of the coolest applications of transformers (in my opinion) is for multi-object tracking in video. Here's how it works ... 🧵[1/7]

6

61

427

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

4 months

Having the ability to clearly explain fundamental concepts in AI to others is incredibly important. To explain large language models (LLMs), I use a simple three-part framework… Why is this important? Given that most AI engineers/researchers work on teams with highly-technical

6

81

433

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

BERT made transfer learning popular in NLP, but follow-up research proposed a ton of new techniques for transfer learning with large language models (LLMs). T5 analyzed these techniques using a unified format. Here’s what we learn from this… 🧵 [1/9]

7

73

424

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

11 months

Can generative models create their own training data? Recent research indicates that we should be careful with doing this! For image generation models especially, there seems to be a reasonable risk of degradation (or even a complete collapse) in performance… What is

Tweet media one

20

91

420

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 months

Research on LLMs is moving quickly, and even models / techniques that have been state-of-the-art for a long time (e.g., GPT-4 and Mixtral) are being quickly dethroned. Here’s a list of my top ten AI developments (each with a brief summary) over the last few months… [1] DBRX is

Tweet media one

Tweet media two

Tweet media three

Tweet media four

9

104

419

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

10 months

The recent success of LLaMA-2, which can be attributed to a variety of factors, clearly demonstrates the massive value of reinforcement learning from human feedback (RLHF). Here’s what the authors of LLaMA have to say about why RLHF is so important… Collecting data for RLHF.

Tweet media one

6

73

414

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Next-token prediction is the workhorse behind all modern advancements in large language models (LLMs) due to its use in training these models over unlabeled text. But, how exactly does this next-token prediction (or language modeling) objective work? Let’s take a deeper look…

Tweet media one

11

68

406

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Large Language Models (LLMs) make awesome foundation models and can be re-purposed for solving a variety of tasks. But, how can we specialize generic LLMs to solve more domain-specific problems? Currently, there are three main approaches…🧵[1/8]

5

61

408

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Diffusion models (DMs) are SOTA for generative modeling of images and video, but their typical formulation requires hundreds of GPU days for training. Stable Diffusion fixed this. Here’s how… 🧵 [1/8]

7

49

405

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

The foundation series by MosaicML, including MPT-7B/30B (and an efficient training repo), makes high-quality pre-trained language models available to anyone for commercial use. Given that creating a pre-trained base model is incredibly expensive, these open-source tools enable a

Tweet media one

16

66

400

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Recent research on language models has aimed to increase the maximum allowable context length of the underlying model. But, how can we enable an LLM to handle longer inputs? One way is through the use of ALiBi… Vanilla position embeddings. Decoder-only transformer architectures

Tweet media one

6

60

400

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

8 months

Retrieval Augmented Generation (RAG) is a popular tool for improving the quality/factuality of LLMs. Self-RAG makes RAG smarter by teaching the LLM to reflect and decide which components of RAG actually help with answering a prompt… TL;DR: RAG is highly effective, but it’s a

Tweet media one

7

70

392

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

3 months

There are a ton of different ways to finetune a language model. Here's a (brief) summary of language model finetuning, the various approaches that exist, their purpose, and what we know about how they work... Finetuning techniques: The term “finetuning” simply refers to further

Tweet media one

7

88

396

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

4 months

Retrieval augmented generation (RAG) was proposed in 2020, but the idea has since been explored and expanded by a variety of papers. Here are four notable publications that study advanced concepts with RAG… (0) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks:

Tweet media one

5

91

390

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Prompt engineering is oftentimes an annoying and brittle process. A small tweak to a prompt could massively change an LLM's output. But, it doesn’t have to be this way! We can adopt techniques like prompt ensembles to improve LLM reliability. 🧵 [1/10]

Tweet media one

6

77

386

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

8 months

Most intro paragraphs for AI/ML papers just re-state the same, basic info about AI. But, the recent "GPT-4 Doesn’t Know It’s Wrong" paper has one of the best intros I've ever read... "Large Language Models (LLMs), essentially n-gram models on steroids which have been trained on

Tweet media one

13

67

388

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

5 months

Most businesses are interested in training a specialized LLM on their own data. However, exposing proprietary data to an LLM is a security risk. Can we ensure that the LLM’s training data will not be leaked? Recent research indicates that the answer is no… TL;DR: Recent

Tweet media one

10

82

379

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 month

LLMs are cool, but getting married was a lot cooler! Thank you everyone for not releasing any new models over the weekend. It was nice to fully disconnect and celebrate with my friends and family! ��️

Tweet media one

Tweet media two

Tweet media three

Tweet media four

50

7

381

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Instruction fine-tuning (or instruction tuning for short) is an incredibly useful method for creating high-performing large language models (LLMs). Here are 3 key ideas you need to know about it…🧵[1/7]

Tweet media one

4

85

373

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

One of the main benefits of GPT-4 relative to prior models (like ChatGPT/GPT-3.5) is that the model is incredibly steerable. Here’s what this means and how you can use it to create better chat experiences… 🧵[1/8]

Tweet media one

11

66

369

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Large Language Models (LLMs) have the potential to be incredibly useful, but they also make a lot of mistakes (e.g., by generating false or biased information). To eliminate this behavior, recent generations of LLMs utilize a two-part refinement process… 🧵 [1/10]

7

56

363

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

6 months

I’ve spent the last ~5 years working on (and writing about) language models. The proposal of Google Gemini made me think about why I am so interested in these models. There are numerous reasons, but the allure of LLMs (at least for me) boils down to 3 core properties… TL;DR:

Tweet media one

7

52

364

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

6 months

The mixture of pretraining data used for Gemini was excluded from the technical report. Data mixology truly seems to be the new black magic for building effective AI systems. But, Gemini does give us a few important data-related learnings... (1) Diverse sources: Whenever

Tweet media one

9

68

360

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

3 months

Masked self-attention is the key building block that allows LLMs to learn rich relationships and patterns between the words of a sentence. Let’s build it together from scratch… The big picture: Large language models are based upon a deep neural network architecture called a

Tweet media one

5

70

363

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

6 months

Looking for something to talk to your family about while you’re home for the holidays? Why not give them a clear, accessible explanation of ChatGPT? Here’s a simple, three-part framework that you can use to explain generative language models to (almost) anyone… TL;DR: We can

Tweet media one

9

55

359

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

8 months

We’ve seen a massive amount of progress in AI/LLM research over the last several weeks. Here are the five highest-impact papers/projects that I’ve been focusing on recently… StreamingLLM solves limitations with LLMs generating long sequences of text. To avoid excessive memory

Tweet media one

4

80

349

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

11 months

Recently proposed open-source language models have placed an emphasis upon inference speed. Such work has shown us that inference speed can be improved by up to 5X (or more) by making some changes to the decoder-only transformer architecture. Here are three examples that have

Tweet media one

8

76

351

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

5 months

Recently-proposed large language models (LLMs) such as Google Gemini are structured and trained in a manner that maximizes efficiency and boosts training stability. But, what common tricks are used to achieve these efficiency/stability benefits? TL;DR: Making LLMs more

Tweet media one

2

66

342

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

11 months

@LukeGessler Pretty cool. Reminds me of using JPEG directly as input for image recognition with neural nets. I bet there's a lot of cool tricks like this out there that we haven't found yet.

Tweet card media

Faster Neural Networks Straight from JPEG | Uber Blog

Uber AI Labs introduces a method for making neural networks that process images faster and more accurately by leveraging JPEG representations.

6

44

348

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Recent research in open-source LLMs has made paid APIs much less enticing (though not hosting your own model is still nice). So much is possible if we are willing to fine-tune on some task-specific data! Here are a few examples to support my point... 🧵[1/6]

8

55

345

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

9 months

Next token prediction is the workhorse of causal language models. Despite recent advancements, LLMs' capabilities are largely attributable to next token prediction. To better grasp this concept, let’s study an implementation of next token prediction for LLM pretaining...

Tweet media one

7

63

330

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

The team at @CohereAI just released an awesome API endpoint (called Rerank) that can easily improve search and recommendation offerings by using LLMs. Here's what you need to know... Some background: Most search engines follow a two-step process. 1. Filtering: a rough/efficient

Tweet media one

12

52

329

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 years

What's the best learning rate schedule to use for training a neural net? This is a simple question that pretty much any deep learning practitioner will ask. I argue that cyclical LR schedules are most practical. Here's why... 🧵 [1/6]

7

53

328

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

Plugins are now supported by ChatGPT. These plugins will basically provide an "App Store" or ecosystem of useful tools that integrate easily with ChatGPT. Here's why this is a massive opportunity for AI developers. Many successful companies have been built by offering apps or

Tweet media one

5

55

327

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

We all know that LLMs tend to make errors, whether it be simple mistakes (e.g., improper arithmetic), hallucinations, or something else. But, studying the statistics of mistakes that LLMs make shows us something that we might not intuitively expect. Background: One way to study

Tweet media one

8

51

323

@cwolferesearch

Cameron R. Wolfe, Ph.D.

@cwolferesearch

1 year

I just found a super useful stable diffusion CLI that can efficiently perform image generation, masking, editing, outpainting, and more with no coding required. Here's some cool stuff you can do with it... 🧵[1/7]

5

59

325