Hao Liu @haoliuhl Twitter profile | Pikagi

Pikagi

Hao Liu

@haoliuhl

4,204

Followers

156

Following

98

Media

267

Statuses

phd student @berkeley_ai machine learning, neural networks.

Joined September 2018

Don't wanna be here? Send us removal request.

Pinned Tweet

@haoliuhl

Hao Liu

3 months

We are excited to share Large World Model (LWM), a general-purpose 1M context multimodal autoregressive model. It is trained on a large dataset of diverse long videos and books using RingAttention, and can perform language, image, and video understanding and generation.

Tweet media one

Tweet media two

Tweet media three

25

263

1K

Last Seen Profiles

@avila56635

@MikeHarrisGolf

@Ruairi_Casey

@PritishSha98893

@HaYatElYaMaNi

@OUTFiT7Official

@SeanMcGann98

@HolgerZastrow

@SultanHass2

@sh1528194094076

@KarenEllio87974

@xX_TwiLight_Xx1

@ThriftBooks

@SCHL_ca

@gang16669

@ChaosRice

@TheStrokeAssoc

@DrivePlanoSeque

@JustBre66034311

@_SnowKnows_

@GEGO_Onlyevent

@AzisDadun_

@MECAcoinEth

@ramprakashr

@jordannolle23

@f_mojo

@Jellysparrow2

@ArstidesMinguez

@LLTenshi

@nicholasaro1

@ControllerModz

@hendoren8295256

@jandakembangstw

@inoriac_

@KathleenGr49666

@_JAIKUWAR_

@haoliuhl

Hao Liu

1 year

As a part of our effort to replicate LLaMA in an open-source manner, we are pleased to announce the release of preview of the 7B OpenLLaMA model that has been trained with 200 billion tokens on the RedPajama dataset.

Tweet card media

GitHub - openlm-research/open_llama: OpenLLaMA, a permissively licensed open source reproduction of...

OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset - openlm-research/open_llama

32

403

2K

@haoliuhl

Hao Liu

8 months

New paper w/ @matei_zaharia @pabbeel on transformers with large context size. We propose RingAttention, which allows training sequences that are device count times longer than those of prior state-of-the-arts, without attention approximations or incurring additional overhead.

Tweet media one

Tweet media two

10

182

854

@haoliuhl

Hao Liu

1 year

Humans learn from rich feedback in the form of language. Why not turning all feedback into a sentence to train models? We propose CoH: Just tell models which ones are not good and which ones are better. Better than SFT and RLHF on summary and dialogue tasks.

Tweet media one

13

120

637

@haoliuhl

Hao Liu

1 year

1/ Excited to share our new paper with @pabbeel on long context models! 📚✍️ Check it out here: Training 7B models with over 130K or 13B models with over 64K context windows on just 8 A100 GPUs! 😮🖥️ Curious how we did it?

Tweet media one

Tweet media two

Tweet media three

8

107

596

@haoliuhl

Hao Liu

4 years

Excited to share our new work that explores the relationship between contrastive learning, discriminative modeling & generative modeling, through the lens of energy-based models. 🎓 💻 w/ @pabbeel summary thread: [1/N]

Tweet media one

5

88

409

@haoliuhl

Hao Liu

1 year

We introduce an unsupervised method to align text and image. Language Quantized AutoEncoders (LQAE) enables few-shot image classification with GPT3 and linear classification of images based on RoBERTa text features. paper: code:

Tweet media one

4

82

396

@haoliuhl

Hao Liu

2 years

Can language model pretraining be even better? Our paper shows that by randomly masking input tokens during pretraining, the zero-shot, few-shot, and fine-tuning performance can be significantly improved. 🧵

Tweet media one

2

41

302

@haoliuhl

Hao Liu

2 years

Excited to share M3AE, a simple but effective model for multimodal representation learning. TLDR: M3AE learns a unified encoder for both vision and language from both paired image-text data as well as unpaired data. w/ @YoungGeng Summary thread: [1/N]

Tweet media one

5

38

249

@haoliuhl

Hao Liu

3 years

A new preprint “Behavior From the Void: Unsupervised Active Pre-Training”. w/ @pabbeel TLDR: A simple yet effective method for reward-free unsupervised pre-training in RL via particle-based entropy maximization. Here is a summary thread👇

Tweet media one

2

52

231

@haoliuhl

Hao Liu

7 months

RingAttention's Jax code is available at In end-to-end FSDP training on GPU (7B params, 8x A100 80G), context expands from 32K to 256K tokens and can reach 16M tokens with 512x A100. On TPU (7B params, 1024x TPUv4, FSDP), context can reach 8M tokens.

Tweet card media

GitHub - lhao499/RingAttention: Transformers with Arbitrarily Large Context

Transformers with Arbitrarily Large Context. Contribute to lhao499/RingAttention development by creating an account on GitHub.

@haoliuhl

Hao Liu

8 months

New paper w/ @matei_zaharia @pabbeel on transformers with large context size. We propose RingAttention, which allows training sequences that are device count times longer than those of prior state-of-the-arts, without attention approximations or incurring additional overhead.

Tweet media one

Tweet media two

10

182

854

3

39

182

@haoliuhl

Hao Liu

1 year

With ChatGPT's mind blowing results, ML community is getting more curious about RLHF. RLHF outperforms Supervised Finetune (SFT) as shown in InstructGPT. But RLHF uses an extra large dataset in step 2. Thus, a missing baseline is SFT on both datasets from step 1 and 2. [1/2]

Tweet media one

3

19

176

@haoliuhl

Hao Liu

1 year

The code of blockwise parallel transformer is now available.

Tweet card media

GitHub - lhao499/RingAttention: Transformers with Arbitrarily Large Context

Transformers with Arbitrarily Large Context. Contribute to lhao499/RingAttention development by creating an account on GitHub.

@haoliuhl

Hao Liu

1 year

1/ Excited to share our new paper with @pabbeel on long context models! 📚✍️ Check it out here: Training 7B models with over 130K or 13B models with over 64K context windows on just 8 A100 GPUs! 😮🖥️ Curious how we did it?

Tweet media one

Tweet media two

Tweet media three

8

107

596

1

21

110

@haoliuhl

Hao Liu

8 months

The possibility of very large context introduces exciting opportunities, such as video-audio-language model, learning from extended feedback or trial-and-error, and AI for science data like gene sequence. Paper link: Code link: coming soon

Tweet card media

Ring Attention with Blockwise Transformers for Near-Infinite Context

Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands...

5

6

100

@haoliuhl

Hao Liu

3 months

Paper: Models: Code: Website: This is a joint work with amazing people @wilson1yan , @mateizaharia , @pieterabbeel

Tweet card media

GitHub - LargeWorldModel/LWM

Contribute to LargeWorldModel/LWM development by creating an account on GitHub.

3

12

91

@haoliuhl

Hao Liu

1 year

How to pretrain large language-vision models to help seeing, acting, and following instructions? We found that using models jointly pretrained on image-text pairs and text-only corpus significantly outperforms baselines. A 🧵 on the paper InstructRL

Tweet card media

Instruction-Following Agents with Multimodal Transformer

Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge....

3

13

88

@haoliuhl

Hao Liu

10 months

At #ICML2023 to present Agentic Transformer and Blockwise Parallel Transformer, and will be hanging out at poster sessions. Pls reach out if you'd like to chat about ML and etc.

Tweet media one

2

13

87

@haoliuhl

Hao Liu

1 year

In our #NeurIPS2022 work, we explore the generality of masked token prediction for generalizable and flexible reinforcement learning. A 🧵 on the paper

Tweet media one

3

21

83

@haoliuhl

Hao Liu

7 months

To corroborate with Yann on the importance of having crowd-sourced human feedback datasets, it appears that the absence of such high-quality datasets has became a research bottleneck. A thread:

@ylecun

Yann LeCun

7 months

Human feedback for open source LLMs needs to be crowd-sourced, Wikipedia style. It is the only way for LLMs to become the repository of all human knowledge and cultures. Who wants to build the platform for this?

212

353

2K

1

9

81

@haoliuhl

Hao Liu

8 months

RingAttention lets you scale context length linearly with device count, breaking free from memory constraints. If you could train 4K length on 8 GPU, with RingAttention, you can train at least 32K length with nearly zero overhead

Tweet media one

1

8

77

@haoliuhl

Hao Liu

3 months

We open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens, along with the codebase for training and inference. The models are available at .

Tweet card media

LargeWorldModel (Large World Model)

1

9

62

@haoliuhl

Hao Liu

1 year

Does interactive learning help developing better perception than learning from static datasets? In our #NeurIPS2022 paper, we propose a method based on unsupervised RL that matches SOTA SSL methods, without using data augmentation. A 🧵 on the paper:

Tweet media one

2

10

56

@haoliuhl

Hao Liu

8 months

We applied RingAttention to finetune a 512K context chatbot on conversations. On the long-range line retrieval task, GPT3.5-turbo-16K and Claude-2-100K demonstrate competitive accuracy within short context lengths. However, they cannot handle extended context lengths.

Tweet media one

1

3

46

@haoliuhl

Hao Liu

8 months

We use the original Transformer's architecture but rearrange the computation. In a ring of devices, each device stores one query block, while key-value blocks rotate through the devices for computing attention and feedforward.

Tweet media one

1

2

45

@haoliuhl

Hao Liu

1 year

We evaluated OpenLLaMA using lm-evaluation-harness from @AiEleuther . Comparing with original LLaMA(1T tokens) and GPT-J(500B tokens), OpenLLaMA(200B tokens) exhibits comparable performance across a majority of tasks, and outperforms them in some tasks.

Tweet card media

GitHub - openlm-research/open_llama: OpenLLaMA, a permissively licensed open source reproduction of...

OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset - openlm-research/open_llama

1

3

44

@haoliuhl

Hao Liu

1 year

We are currently focused on completing the training process on the entire RedPajama dataset. This should give us an apple-to-apple comparison between the original LLaMA and our OpenLLaMA. Please stay tuned for when this will be available!

2

1

44

@haoliuhl

Hao Liu

1 year

Motivated by examining how LLaMA's data curation contributes to its exceptional performance and creating a fully open-source version of LLaMA, we decided to replicate LLaMA with identical training hyperparameters and model configuration as the original.

1

3

40

@haoliuhl

Hao Liu

1 year

We train OpenLLaMA on cloud TPU-v4 pod using data parallelism and FSDP/Zero3 to balance throughput and memory usage. Overall we reach a throughput of over 1900 tokens / second / TPU-v4 chip in our training run. The training loss can be seen in the figure below.

Tweet media one

2

4

39

@haoliuhl

Hao Liu

1 year

We train OpenLLaMA on the RedPajama dataset curated by @togethercompute , which is an open reproduction of LLaMA dataset containing 1.2 trillion tokens and roughly match the number of tokens as LLaMA. You can find more details in Together's blog .

Tweet card media

RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training...

www.together.ai

1

4

38

@haoliuhl

Hao Liu

1 year

Many thanks to @_akhaliq for sharing our arxiv paper :)

@_akhaliq

AK

1 year

Blockwise Parallel Transformer for Long Context Large Models present a distinct approach, Blockwise Parallel Transformer (BPT), that leverages blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences

Tweet media one

0

20

103

2

2

38

@haoliuhl

Hao Liu

8 months

Just a quick note on training FLOPs w/ exact attention: Scaling context doesn't mean a quadratic increase in FLOPs per dataset. For GPU-rich, 4K -> 10M context on a 175B model incurs 150x FLOPs per dataset. For GPU-poor, we can use 8 GPU to expand context by 8 times with 2x cost.

Tweet media one

1

3

38

@haoliuhl

Hao Liu

1 year

This is a joint work with amazing collaborators @carlo_sferrazza and @pabbeel . Check out the paper and code for more details. The code supports fairly large-scale training/finetuning too. Paper: Code:

1

3

38

@haoliuhl

Hao Liu

2 years

Some thoughts on the work led by my amazing collaborators @denisyarats and @brandfonbrener . With diverse data, many problems in RL just go away. Bitter lesson strikes again. ExORL could be very useful for future offline and unsupervised RL research.

Tweet card media

Don't Change the Algorithm, Change the Data: Exploratory Data...

Recent progress in deep learning has relied on access to large and diverse datasets. Such data-driven progress has been less evident in offline reinforcement learning (RL), because offline RL data...

@denisyarats

Denis Yarats

2 years

Currently, Offline RL data is collected under the same reward that is used for evaluation, not ideal... @brandfonbrener and I propose an alternative approach – ExORL, that uses Unsupervised RL & relabeling to construct datasets for Offline RL. paper: 1/10

Tweet media one

4

31

155

2

12

37

@haoliuhl

Hao Liu

1 year

This open source project wouldn't be possible without the diligent efforts from @younggeng . We’d welcome any feedback and contributions!

3

2

34

@haoliuhl

Hao Liu

8 months

RingAttention sets new records. It can handle sequences device count times longer than previous bests: >16M context with 30B model on TPUv4-512 (512 times longer), >16M for 13B on 32x A100 (32 times longer), and >2M for 13B on 8x A100 (8 times longer).

Tweet media one

1

2

34

@haoliuhl

Hao Liu

8 months

As we compute attention, each host sends key-value blocks to the next host while receives key-value blocks from the preceding host. If block size is larger than a threshold, the communication of key-value block is fully overlapped by the computation of attention and feedforward.

Tweet media one

1

0

33

@haoliuhl

Hao Liu

8 months

With RingAttention, each device's memory requirement is linear with the block size instead of the entire sequence. Here, b is batch size, h is hidden dimension, n is number of head, s is sequence length, c is block size.

Tweet media one

1

1

33

@haoliuhl

Hao Liu

1 year

We expect that the performance of OpenLLaMA, after completing its training on 1 trillion tokens, will be enhanced even further. The current release is only a preview of what the complete OpenLLaMA release will offer.

1

1

31

@haoliuhl

Hao Liu

3 months

We curated a very large dataset of diverse long videos and texts and proposed a two-stage training to enable large context world model on video and language. We expand Llama2's context progressively from 4K to 1M on language and video to manage compute cost.

Tweet media one

1

2

31

@haoliuhl

Hao Liu

1 year

@pabbeel 11/ Our Blockwise Parallel Transformer allows for 32x longer context lengths than the vanilla Transformer and 4x longer than the memory efficient Transformer. It enables training sequences of 65K for a 30B model on 8 GPUs and 3B model on 1 GPU.

Tweet media one

2

4

30

@haoliuhl

Hao Liu

1 year

@pabbeel 14/ We are excited to see what's next: what new capabilities will emerge from being able to train longer context large Transformers? Check out the paper for more details, full code will be released soon. Paper:

Tweet card media

Blockwise Parallel Transformer for Large Context Models

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory...

1

4

23

@haoliuhl

Hao Liu

3 months

Current approaches on modeling the world are mostly restricted to short text or image sequences, limiting their understanding about parts of the world that are hard to represent in texts or short clips, and are unable to process complex long-form language and visual tasks.

1

0

27

@haoliuhl

Hao Liu

3 months

By expanding context on long-form books and model-generated QA data. LWM achieves near-perfect accuracy on the popular needle retrieval task, outperforming GPT4 and Gemini Pro.

Tweet media one

Tweet media two

Tweet media three

1

3

25

@haoliuhl

Hao Liu

8 months

RingAttention requires only a ring topology which is very minimal and supported on GPU and TPUs. The minimal block size is determined by FLOPs/unidirectional bandwidth and can be easily met with using efficient blockwise attention and ffn on each device.

Tweet media one

1

1

24

@haoliuhl

Hao Liu

3 months

LWM opens up exciting possibilities for the development of more capable AI systems that understand both textual knowledge and multimodal world and solve a wide range of problems. Paper: Code: Website:

Tweet card media

GitHub - LargeWorldModel/LWM

Contribute to LargeWorldModel/LWM development by creating an account on GitHub.

4

1

24

@haoliuhl

Hao Liu

7 months

Upon close inspection, the RLHF datasets' conversations (e.g., HH dataset) have a substantially worse quality and diversity than ShareGPT. So Koala already captures a good distribution via SFT on ShareGPT, further RLHF/CoH/SFT on the HH dataset deteriorates model performance.

2

0

23

@haoliuhl

Hao Liu

1 year

@pabbeel 2/ Our method, the Blockwise Parallel Transformer, leverages blockwise computation of self-attention and fused feedforward to minimize memory costs. We use the same model architecture as the original Transformer, but with a different way of organizing the compute.

1

1

21

@haoliuhl

Hao Liu

2 years

Thanks @ak92501 for tweeting so fast!

@_akhaliq

AK

2 years

Multimodal Masked Autoencoders Learn Transferable Representations abs: the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction

Tweet media one

1

26

173

0

3

20

@haoliuhl

Hao Liu

1 year

Interestingly while our model generated text outputs are not interpretable to human, LLM can successfully do few-shot learning from them. This suggests LQAE and BERT-like model generated text tokens contain patterns that can be successfully captured and leveraged by powerful LLM.

Tweet media one

2

4

20

@haoliuhl

Hao Liu

3 months

After training the 1M context language model on a large dataset of diverse visual and language sequences with masked sequence packing and RingAttention, LWM can perform language, image, and video understanding and generation. LWM can do text-to-image generation.

Tweet media one

Tweet media two

1

0

20

@haoliuhl

Hao Liu

1 year

How to learn from all feedback without RL? Our idea: Humans learn from rich feedback in the form of language. Given that LLM is already powerful, why not turning all feedback into a sentence and train model to follow the feedback? We propose chain-of-hindsight (CoH):

Tweet media one

1

1

19

@haoliuhl

Hao Liu

1 year

@pabbeel 3/ This enables 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.

Tweet media one

1

1

18

@haoliuhl

Hao Liu

5 months

Heading to @NeurIPSConf now. I'm thinking about generalization of language models and interactive agents. Please say hi if you're into about these too. I will be presenting BPT & LQAE posters, plus EAI and RingAttention in workshops.

1

0

17

@haoliuhl

Hao Liu

1 year

@pabbeel 13/ We applied BPT to learn trial-and-errors across trajectories, leveraging the Agentic Transformer and Algorithm Distillation . Our long context model consistently surpass the original Transformer model across all six tasks.

Tweet media one

1

1

17

@haoliuhl

Hao Liu

3 months

LWM can do video generation from text prompts. See website for more video examples:

Tweet media one

Tweet media two

1

0

17

@haoliuhl

Hao Liu

3 months

LWM can answer questions based on over 1 hour long YouTube video, while GPT-4V and Gemini Pro Vision struggle.

Tweet media one

Tweet media two

Tweet media three

3

1

17

@haoliuhl

Hao Liu

1 year

@pabbeel 4/ Transformer's self-attention & position-wise feedforward efficiently capture long-range dependencies, enabling scalability in context length & model size through parallel computations.

1

0

15

@haoliuhl

Hao Liu

1 year

@pabbeel 12/ In terms of speed, using high-level Jax operations, BPT enables high-throughput training that matches or surpasses the speed of vanilla and memory efficient Transformers. Porting our method to low-level kernels in CUDA or Triton will achieve maximum speedup.

Tweet media one

1

0

16

@haoliuhl

Hao Liu

3 months

We propose model-generated QA to address limited long-text data at this stage: We split documents into fixed chunks for our short-context model to generate QA pairs per chunk. Then, we construct long-context examples by merging adjacent chunks and appending QA pairs to the end.

1

0

16

@haoliuhl

Hao Liu

1 year

@pabbeel 9/ In the diagram, we explain our idea by showing that for the bottom first incoming input block, we project it into query; then we iterate over the same input sequence positioned above the bottom row, and project it to key and value.

Tweet media one

1

0

16

@haoliuhl

Hao Liu

1 year

The techniques behind ChatGPT are Supervised finetuning (SFT) and reinforcement learning with human feedback (RLHF). SFT is simple and scalable but cannot use negative feedback. RLHF uses all feedback but is very complex and very hard to tune. Can we go beyond SFT and RLHF?

Tweet media one

1

1

16

@haoliuhl

Hao Liu

3 months

We found that vision-language training needs to mix images, videos, and pure texts together. Without pure texts (e.g. openllama v2 mix), model overfits to vision; without images, video generation has low visual quality since videos often have lower visual quality than images.

1

1

16

@haoliuhl

Hao Liu

3 months

In this work, we propose LWM to model complex million-length language and visual sequences. We curated a large, diverse dataset and utilized RingAttention to scalably train on it. We discover challenges and propose masked sequence packing and model-generated QA to address them.

1

1

16

@haoliuhl

Hao Liu

3 months

We propose masked sequence packing such that each image and text pair only attends to tokens within the pair. Mixing images, texts, and video with standard sequence packing, widely used in current approaches of language model training, leads to very suboptimal model performance.

1

0

16

@haoliuhl

Hao Liu

4 years

💡 We took inspiration from Ng ( @AndrewYNg ) & Jordan 2002, which showed that classifiers trained with a generative loss can outperform classifiers trained with a discriminative loss. Our work can be seen as lifting it into today’s context of training deep NNs. [3/N]

2

2

15

@haoliuhl

Hao Liu

1 year

@pabbeel 8/ To overcome this challenge, we observed that merging the computation of feedforward and attention block by block eliminates the need for performing the feedforward step on the entire sequence, which significantly cut memory cost.

1

0

15

@haoliuhl

Hao Liu

1 year

To conclude: CoH is a simple framework for aligning language models with feedback. The idea is turning all feedback into a sequence to train models. The real world offers many different forms of feedback, which present interesting opportunities for learning in the future.

Tweet media one

1

1

14

@haoliuhl

Hao Liu

3 months

This work provides a highly-optimized, open-source implementation with RingAttention, masked sequence packing, model-generated QA, and other key features for millions-length vision-language training. We have good MFUs even at very large context sizes.

Tweet media one

Tweet media two

1

1

15

@haoliuhl

Hao Liu

1 year

@pabbeel 6/ Rabe et al and FlashAttention Dao et al introduced a memory-efficient attention technique that utilizes the well-established online softmax to compute self-attention block by block, allowing computing exact self-attention with linear memory complexity.

1

0

15

@haoliuhl

Hao Liu

1 year

@pabbeel 15/ This was a fun project. We thank the members of the Berkeley Robot Learning Lab and Berkeley AI Lab as well as Anselm Levskaya, Markus Rabe, Federico Lebron, and Sharad Vikram at Google for their insightful discussions and suggestions.

2

0

15

@haoliuhl

Hao Liu

1 year

This baseline, improved SFT, should outperform SFT due to more human labeled data. But how much does RLHF still outperform improved SFT? Having an answer is helpful in understanding and improving results. It would be great if this baseline could be included in the future. [2/2]

0

0

15

@haoliuhl

Hao Liu

3 months

LWM can answer questions about images.

Tweet media one

1

2

14

@haoliuhl

Hao Liu

1 year

@pabbeel 5/ But quadratic self-attention & large feedforward network require a large amount of memory, challenging scalability for longer input sequences.

1

0

14

@haoliuhl

Hao Liu

2 years

The Jax implementation has been released Some additional features added: -Predicting discretized image tokens from VQGAN as output (similar to BEiT). -Training on a combination of paired image-text data (e.g. CC12M) and unpaired text data (e.g. Wikipedia)

Tweet card media

GitHub - young-geng/m3ae_public: Multimodal Masked Autoencoders (M3AE): A JAX/Flax Implementation

Multimodal Masked Autoencoders (M3AE): A JAX/Flax Implementation - young-geng/m3ae_public

@haoliuhl

Hao Liu

2 years

Excited to share M3AE, a simple but effective model for multimodal representation learning. TLDR: M3AE learns a unified encoder for both vision and language from both paired image-text data as well as unpaired data. w/ @YoungGeng Summary thread: [1/N]

Tweet media one

5

38

249

0

1

14

@haoliuhl

Hao Liu

1 year

LLMs are in-context and multi-task learners after unsupervised learning on broad data. But how to learn from ubiquitous feedback in the real world? ChatGPT and InstructGPT show amazing results by learning from human feedback.

Tweet media one

1

1

14

@haoliuhl

Hao Liu

3 months

Starting with the 1M context language model, we train on mixed formats: images, videos, and texts in diverse formats (text-image, image-text, video-text, text-video, etc.) using autoregressive prediction. Essentially in an any-to-any prediction manner with multiple modalities.

Tweet media one

1

0

14

@haoliuhl

Hao Liu

1 year

@pabbeel 10/ These query, key and value are used to compute self-attention (yellow box), whose output is pass to feedforward network (cyan box), followed by a residual connection.

Tweet media one

1

0

13

@haoliuhl

Hao Liu

1 year

@pabbeel 7/ Despite reduced memory needs in self-attention, a challenge remains with the large parameter count and high-dimensional vectors of the feedforward network. This becomes the primary memory issue when using memory-efficient attention.

1

0

13

@haoliuhl

Hao Liu

1 year

Better summarization. CoH outperforms SFT and RLHF on summarization benchmark. CoH achieves higher scores (left fig) and generates summary that is significantly more preferred by human evaluation (right table) than SFT and RLHF.

Tweet media one

1

0

13

@haoliuhl

Hao Liu

8 months

@HlibIvanov @matei_zaharia @pabbeel Stay tuned! We are interested in training / finetuning large context LLM/VLM with RingAttention.

2

1

13

@haoliuhl

Hao Liu

1 year

CoH just consists of a likelihood function and is simple to implement. It comes with several advantages: 1. More natural type of feedback 2. More natural form for training procedure 3. More effective experimental results CoH outperforms RLHF and SFT in a wide range of tasks.

1

0

12

@haoliuhl

Hao Liu

1 year

Better dialogue. CoH outperforms SFT and RLHF on dialogue benchmark from AnthropicAI human preference dataset. CoH achieves higher accuracy at classifying which dialogue is preferred (left fig) and is substantially more preferred by human (right table) than SFT and RLHF.

Tweet media one

2

0

12

@haoliuhl

Hao Liu

1 year

At inference time: CoH uses positive feedback guides the model to generate the desired outputs, such as "generate a good and informative summary". Since CoH has seen different comparisons, it can follow follow-up instructions such as "generate a better summary".

Tweet media one

1

1

11

@haoliuhl

Hao Liu

2 years

Our method, Forgetful Causal Masking(FCM), combines masked language modeling (MLM) and causal language modeling (CLM) by masking out randomly selected past tokens layer-wisely using attention mask.

Tweet media one

1

2

10

@haoliuhl

Hao Liu

1 year

Better controllable generation. CoH is better at following multi-round instructions than the second best RLHF, for instance "Generate a good summary", "Generate a shorter and more precise summary".

Tweet media one

1

0

10

@haoliuhl

Hao Liu

2 years

Properties of FCM 1. no extra compute cost 2. simple to implement and works 3. scales well to larger models Applying FCM to PaLM trained on C4, it improves zero-shot SuperGLUE performance from 55.7% to 59.2% (1B model) and 61.6% to 64.0% (8B model).

Tweet media one

1

0

8

@haoliuhl

Hao Liu

4 years

🌈 Concurrently, @jimwinkens et al. , @sangwoomo & @BunelR et al. showed that contrastive learning(e.g. SimCLR) improves OOD detection of classifiers! However, HDGE's contrastive loss term doesn't rely on data augmentation. [9/N]

Tweet card media

CSI: Novelty Detection via Contrastive Learning on...

Novelty detection, i.e., identifying whether a given sample is drawn from outside the training distribution, is essential for reliable machine learning. To this end, there have been many attempts...

2

0

8

@haoliuhl

Hao Liu

2 years

Thanks for the attention. Check out the paper for more details. We are excited to apply this technique to improve large language models and beyond. All comments and feedback are welcome.

Tweet card media

Towards Better Few-Shot and Finetuning Performance with Forgetful...

Large language models (LLM) trained using the next-token-prediction objective, such as GPT3 and PaLM, have revolutionized natural language processing in recent years by showing impressive...

1

0

8

@haoliuhl

Hao Liu

2 years

Our largest 8B model matches the score of PaLM with an average score of 64%, despite the fact that PaLM is trained on a much larger dataset (780B tokens) of high-quality conversation and webpage data, while ours is trained on the smaller C4 dataset (180B tokens).

Tweet media one

2

0

8

@haoliuhl

Hao Liu

7 months

The links of great work mentioned above: Alpaca(), Vicuna(), AlpacaEval(), MTBench().

0

0

8

@haoliuhl

Hao Liu

1 year

The idea is to encode image as sequences of text tokens by directly quantizing image embeddings using a pretrained BERT codebook. We then apply random masking followed by a BERT model, and have the decoder reconstruct the original image from BERT predicted text token embeddings.

Tweet media one

1

1

8

@haoliuhl

Hao Liu

1 year

During training time: CoH randomly samples one or multiple model outputs and use them to form a sentence consists of both positive and negative feedback in the form of comparison, such as "The following is a bad summary" and "The following summary is better".

Tweet media one

1

1

8

@haoliuhl

Hao Liu

7 months

Just like prior AI breakthroughs which are based on open and high quality datasets (thanks ImageNet/Atari/Wikipedia), in order to advance research on language models, we probably also need crowd-sourced human feedback datasets that are built with open source models.

1

1

7

@haoliuhl

Hao Liu

7 months

In Feb, we proposed CoH, a SFT based alternative to RL-based RLHF. We were excited to see that such a straightforward conditional training appears to outperform RL-based RLHF on public human feedback dataset such as Anthropic's HH dataset.

@haoliuhl

Hao Liu

1 year

Humans learn from rich feedback in the form of language. Why not turning all feedback into a sentence to train models? We propose CoH: Just tell models which ones are not good and which ones are better. Better than SFT and RLHF on summary and dialogue tasks.

Tweet media one

13

120

637

1

0

7

@haoliuhl

Hao Liu

1 year

A higher masking ratio than normal is necessary for good downstream performance, as standard language denoisers such as BERT commonly use a masking ratio of 15%, where LQAE performance is highest at around 50% masking ratio.

Tweet media one

2

0

7

@haoliuhl

Hao Liu

2 years

📝 Due to its flexibility and scalability, M3AE is especially suitable for learning from extremely large-scale datasets, and we envision that such pre-trained models can be broadly applicable in many practical downstream tasks, such as visual reasoning and RL. [10/N]

1

0

7

@haoliuhl

Hao Liu

4 years

🔥 the link above missed username handle(sorry about that), the correct link is [N+1/N]

Tweet card media

GitHub - lhao499/HDGE: Hybrid Discriminative-Generative Training via Contrastive Learning

Hybrid Discriminative-Generative Training via Contrastive Learning - lhao499/HDGE

1

1

6

@haoliuhl

Hao Liu

10 months

Blockwise Parallel Transformer (BPT) reorganizes computation of transformer to reduce memory cost of transformers to linear w/o modifying architecture. Jax code for training long context LLaMA using BPT.

Tweet card media

Blockwise Parallel Transformer for Large Context Models

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory...

0

1

6

@haoliuhl

Hao Liu

2 years

Thanks to the free MLM training, not only FCM improves zero-shot learning, it also improves finetuning performance, from 67.0% to 68.7% (1B model) and 81.0% to 82.5% (8B model).

Tweet media one

1

0

6

@haoliuhl

Hao Liu

1 year

Language models are trained on text corpus, fundamentally lack visual perception -- a crucial attribute needed to extend these models to be able to interact with the real world and solve vision tasks, such as in visual-question answering and robotics.

Tweet media one

1

0

6