Hao Liu Profile
Hao Liu

@haoliuhl

4,204
Followers
156
Following
98
Media
267
Statuses

phd student @berkeley_ai machine learning, neural networks.

Joined September 2018
Don't wanna be here? Send us removal request.
Pinned Tweet
@haoliuhl
Hao Liu
3 months
We are excited to share Large World Model (LWM), a general-purpose 1M context multimodal autoregressive model. It is trained on a large dataset of diverse long videos and books using RingAttention, and can perform language, image, and video understanding and generation.
Tweet media one
Tweet media two
Tweet media three
25
263
1K
@haoliuhl
Hao Liu
1 year
As a part of our effort to replicate LLaMA in an open-source manner, we are pleased to announce the release of preview of the 7B OpenLLaMA model that has been trained with 200 billion tokens on the RedPajama dataset.
32
403
2K
@haoliuhl
Hao Liu
8 months
New paper w/ @matei_zaharia @pabbeel on transformers with large context size. We propose RingAttention, which allows training sequences that are device count times longer than those of prior state-of-the-arts, without attention approximations or incurring additional overhead.
Tweet media one
Tweet media two
10
182
854
@haoliuhl
Hao Liu
1 year
Humans learn from rich feedback in the form of language. Why not turning all feedback into a sentence to train models? We propose CoH: Just tell models which ones are not good and which ones are better. Better than SFT and RLHF on summary and dialogue tasks.
Tweet media one
13
120
637
@haoliuhl
Hao Liu
1 year
1/ Excited to share our new paper with @pabbeel on long context models! 📚✍️ Check it out here: Training 7B models with over 130K or 13B models with over 64K context windows on just 8 A100 GPUs! 😮🖥️ Curious how we did it?
Tweet media one
Tweet media two
Tweet media three
8
107
596
@haoliuhl
Hao Liu
4 years
Excited to share our new work that explores the relationship between contrastive learning, discriminative modeling & generative modeling, through the lens of energy-based models. 🎓 💻 w/ @pabbeel summary thread: [1/N]
Tweet media one
5
88
409
@haoliuhl
Hao Liu
1 year
We introduce an unsupervised method to align text and image. Language Quantized AutoEncoders (LQAE) enables few-shot image classification with GPT3 and linear classification of images based on RoBERTa text features. paper: code:
Tweet media one
4
82
396
@haoliuhl
Hao Liu
2 years
Can language model pretraining be even better? Our paper shows that by randomly masking input tokens during pretraining, the zero-shot, few-shot, and fine-tuning performance can be significantly improved. 🧵
Tweet media one
2
41
302
@haoliuhl
Hao Liu
2 years
Excited to share M3AE, a simple but effective model for multimodal representation learning. TLDR: M3AE learns a unified encoder for both vision and language from both paired image-text data as well as unpaired data. w/ @YoungGeng Summary thread: [1/N]
Tweet media one
5
38
249
@haoliuhl
Hao Liu
3 years
A new preprint “Behavior From the Void: Unsupervised Active Pre-Training”. w/ @pabbeel TLDR: A simple yet effective method for reward-free unsupervised pre-training in RL via particle-based entropy maximization. Here is a summary thread👇
Tweet media one
2
52
231
@haoliuhl
Hao Liu
7 months
RingAttention's Jax code is available at In end-to-end FSDP training on GPU (7B params, 8x A100 80G), context expands from 32K to 256K tokens and can reach 16M tokens with 512x A100. On TPU (7B params, 1024x TPUv4, FSDP), context can reach 8M tokens.
@haoliuhl
Hao Liu
8 months
New paper w/ @matei_zaharia @pabbeel on transformers with large context size. We propose RingAttention, which allows training sequences that are device count times longer than those of prior state-of-the-arts, without attention approximations or incurring additional overhead.
Tweet media one
Tweet media two
10
182
854
3
39
182
@haoliuhl
Hao Liu
1 year
With ChatGPT's mind blowing results, ML community is getting more curious about RLHF. RLHF outperforms Supervised Finetune (SFT) as shown in InstructGPT. But RLHF uses an extra large dataset in step 2. Thus, a missing baseline is SFT on both datasets from step 1 and 2. [1/2]
Tweet media one
3
19
176
@haoliuhl
Hao Liu
1 year
The code of blockwise parallel transformer is now available.
@haoliuhl
Hao Liu
1 year
1/ Excited to share our new paper with @pabbeel on long context models! 📚✍️ Check it out here: Training 7B models with over 130K or 13B models with over 64K context windows on just 8 A100 GPUs! 😮🖥️ Curious how we did it?
Tweet media one
Tweet media two
Tweet media three
8
107
596
1
21
110
@haoliuhl
Hao Liu
8 months
The possibility of very large context introduces exciting opportunities, such as video-audio-language model, learning from extended feedback or trial-and-error, and AI for science data like gene sequence. Paper link: Code link: coming soon
5
6
100
@haoliuhl
Hao Liu
1 year
How to pretrain large language-vision models to help seeing, acting, and following instructions? We found that using models jointly pretrained on image-text pairs and text-only corpus significantly outperforms baselines. A 🧵 on the paper InstructRL
3
13
88
@haoliuhl
Hao Liu
10 months
At #ICML2023 to present Agentic Transformer and Blockwise Parallel Transformer, and will be hanging out at poster sessions. Pls reach out if you'd like to chat about ML and etc.
Tweet media one
2
13
87
@haoliuhl
Hao Liu
1 year
In our #NeurIPS2022 work, we explore the generality of masked token prediction for generalizable and flexible reinforcement learning. A 🧵 on the paper
Tweet media one
3
21
83
@haoliuhl
Hao Liu
7 months
To corroborate with Yann on the importance of having crowd-sourced human feedback datasets, it appears that the absence of such high-quality datasets has became a research bottleneck. A thread:
@ylecun
Yann LeCun
7 months
Human feedback for open source LLMs needs to be crowd-sourced, Wikipedia style. It is the only way for LLMs to become the repository of all human knowledge and cultures. Who wants to build the platform for this?
212
353
2K
1
9
81
@haoliuhl
Hao Liu
8 months
RingAttention lets you scale context length linearly with device count, breaking free from memory constraints. If you could train 4K length on 8 GPU, with RingAttention, you can train at least 32K length with nearly zero overhead
Tweet media one
1
8
77
@haoliuhl
Hao Liu
3 months
We open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens, along with the codebase for training and inference. The models are available at .
1
9
62
@haoliuhl
Hao Liu
1 year
Does interactive learning help developing better perception than learning from static datasets? In our #NeurIPS2022 paper, we propose a method based on unsupervised RL that matches SOTA SSL methods, without using data augmentation. A 🧵 on the paper:
Tweet media one
2
10
56
@haoliuhl
Hao Liu
8 months
We applied RingAttention to finetune a 512K context chatbot on conversations. On the long-range line retrieval task, GPT3.5-turbo-16K and Claude-2-100K demonstrate competitive accuracy within short context lengths. However, they cannot handle extended context lengths.
Tweet media one
1
3
46
@haoliuhl
Hao Liu
8 months
We use the original Transformer's architecture but rearrange the computation. In a ring of devices, each device stores one query block, while key-value blocks rotate through the devices for computing attention and feedforward.
Tweet media one
1
2
45
@haoliuhl
Hao Liu
1 year
We evaluated OpenLLaMA using lm-evaluation-harness from @AiEleuther . Comparing with original LLaMA(1T tokens) and GPT-J(500B tokens), OpenLLaMA(200B tokens) exhibits comparable performance across a majority of tasks, and outperforms them in some tasks.
1
3
44
@haoliuhl
Hao Liu
1 year
We are currently focused on completing the training process on the entire RedPajama dataset. This should give us an apple-to-apple comparison between the original LLaMA and our OpenLLaMA. Please stay tuned for when this will be available!
2
1
44
@haoliuhl
Hao Liu
1 year
Motivated by examining how LLaMA's data curation contributes to its exceptional performance and creating a fully open-source version of LLaMA, we decided to replicate LLaMA with identical training hyperparameters and model configuration as the original.
1
3
40
@haoliuhl
Hao Liu
1 year
We train OpenLLaMA on cloud TPU-v4 pod using data parallelism and FSDP/Zero3 to balance throughput and memory usage. Overall we reach a throughput of over 1900 tokens / second / TPU-v4 chip in our training run. The training loss can be seen in the figure below.
Tweet media one
2
4
39
@haoliuhl
Hao Liu
1 year
We train OpenLLaMA on the RedPajama dataset curated by @togethercompute , which is an open reproduction of LLaMA dataset containing 1.2 trillion tokens and roughly match the number of tokens as LLaMA. You can find more details in Together's blog .
1
4
38
@haoliuhl
Hao Liu
1 year
Many thanks to @_akhaliq for sharing our arxiv paper :)
@_akhaliq
AK
1 year
Blockwise Parallel Transformer for Long Context Large Models present a distinct approach, Blockwise Parallel Transformer (BPT), that leverages blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences
Tweet media one
0
20
103
2
2
38
@haoliuhl
Hao Liu
8 months
Just a quick note on training FLOPs w/ exact attention: Scaling context doesn't mean a quadratic increase in FLOPs per dataset. For GPU-rich, 4K -> 10M context on a 175B model incurs 150x FLOPs per dataset. For GPU-poor, we can use 8 GPU to expand context by 8 times with 2x cost.
Tweet media one
1
3
38
@haoliuhl
Hao Liu
1 year
This is a joint work with amazing collaborators @carlo_sferrazza and @pabbeel . Check out the paper and code for more details. The code supports fairly large-scale training/finetuning too. Paper: Code:
1
3
38
@haoliuhl
Hao Liu
2 years
Some thoughts on the work led by my amazing collaborators @denisyarats and @brandfonbrener . With diverse data, many problems in RL just go away. Bitter lesson strikes again. ExORL could be very useful for future offline and unsupervised RL research.
@denisyarats
Denis Yarats
2 years
Currently, Offline RL data is collected under the same reward that is used for evaluation, not ideal... @brandfonbrener and I propose an alternative approach – ExORL, that uses Unsupervised RL & relabeling to construct datasets for Offline RL. paper: 1/10
Tweet media one
4
31
155
2
12
37
@haoliuhl
Hao Liu
1 year
This open source project wouldn't be possible without the diligent efforts from @younggeng . We’d welcome any feedback and contributions!
3
2
34
@haoliuhl
Hao Liu
8 months
RingAttention sets new records. It can handle sequences device count times longer than previous bests: >16M context with 30B model on TPUv4-512 (512 times longer), >16M for 13B on 32x A100 (32 times longer), and >2M for 13B on 8x A100 (8 times longer).
Tweet media one
1
2
34
@haoliuhl
Hao Liu
8 months
As we compute attention, each host sends key-value blocks to the next host while receives key-value blocks from the preceding host. If block size is larger than a threshold, the communication of key-value block is fully overlapped by the computation of attention and feedforward.
Tweet media one
1
0
33
@haoliuhl
Hao Liu
8 months
With RingAttention, each device's memory requirement is linear with the block size instead of the entire sequence. Here, b is batch size, h is hidden dimension, n is number of head, s is sequence length, c is block size.
Tweet media one
1
1
33
@haoliuhl
Hao Liu
1 year
We expect that the performance of OpenLLaMA, after completing its training on 1 trillion tokens, will be enhanced even further. The current release is only a preview of what the complete OpenLLaMA release will offer.
1
1
31
@haoliuhl
Hao Liu
3 months
We curated a very large dataset of diverse long videos and texts and proposed a two-stage training to enable large context world model on video and language. We expand Llama2's context progressively from 4K to 1M on language and video to manage compute cost.
Tweet media one
1
2
31
@haoliuhl
Hao Liu
1 year
@pabbeel 11/ Our Blockwise Parallel Transformer allows for 32x longer context lengths than the vanilla Transformer and 4x longer than the memory efficient Transformer. It enables training sequences of 65K for a 30B model on 8 GPUs and 3B model on 1 GPU.
Tweet media one
2
4
30
@haoliuhl
Hao Liu
1 year
@pabbeel 14/ We are excited to see what's next: what new capabilities will emerge from being able to train longer context large Transformers? Check out the paper for more details, full code will be released soon. Paper:
1
4
23
@haoliuhl
Hao Liu
3 months
Current approaches on modeling the world are mostly restricted to short text or image sequences, limiting their understanding about parts of the world that are hard to represent in texts or short clips, and are unable to process complex long-form language and visual tasks.
1
0
27
@haoliuhl
Hao Liu
3 months
By expanding context on long-form books and model-generated QA data. LWM achieves near-perfect accuracy on the popular needle retrieval task, outperforming GPT4 and Gemini Pro.
Tweet media one
Tweet media two
Tweet media three
1
3
25
@haoliuhl
Hao Liu
8 months
RingAttention requires only a ring topology which is very minimal and supported on GPU and TPUs. The minimal block size is determined by FLOPs/unidirectional bandwidth and can be easily met with using efficient blockwise attention and ffn on each device.
Tweet media one
1
1
24
@haoliuhl
Hao Liu
3 months
LWM opens up exciting possibilities for the development of more capable AI systems that understand both textual knowledge and multimodal world and solve a wide range of problems. Paper: Code: Website:
4
1
24
@haoliuhl
Hao Liu
7 months
Upon close inspection, the RLHF datasets' conversations (e.g., HH dataset) have a substantially worse quality and diversity than ShareGPT. So Koala already captures a good distribution via SFT on ShareGPT, further RLHF/CoH/SFT on the HH dataset deteriorates model performance.
2
0
23
@haoliuhl
Hao Liu
1 year
@pabbeel 2/ Our method, the Blockwise Parallel Transformer, leverages blockwise computation of self-attention and fused feedforward to minimize memory costs. We use the same model architecture as the original Transformer, but with a different way of organizing the compute.
1
1
21
@haoliuhl
Hao Liu
2 years
Thanks @ak92501 for tweeting so fast!
@_akhaliq
AK
2 years
Multimodal Masked Autoencoders Learn Transferable Representations abs: the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction
Tweet media one
1
26
173
0
3
20
@haoliuhl
Hao Liu
1 year
Interestingly while our model generated text outputs are not interpretable to human, LLM can successfully do few-shot learning from them. This suggests LQAE and BERT-like model generated text tokens contain patterns that can be successfully captured and leveraged by powerful LLM.
Tweet media one
2
4
20
@haoliuhl
Hao Liu
3 months
After training the 1M context language model on a large dataset of diverse visual and language sequences with masked sequence packing and RingAttention, LWM can perform language, image, and video understanding and generation. LWM can do text-to-image generation.
Tweet media one
Tweet media two
1
0
20
@haoliuhl
Hao Liu
1 year
How to learn from all feedback without RL? Our idea: Humans learn from rich feedback in the form of language. Given that LLM is already powerful, why not turning all feedback into a sentence and train model to follow the feedback? We propose chain-of-hindsight (CoH):
Tweet media one
1
1
19
@haoliuhl
Hao Liu
1 year
@pabbeel 3/ This enables 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.
Tweet media one
1
1
18
@haoliuhl
Hao Liu
5 months
Heading to @NeurIPSConf now. I'm thinking about generalization of language models and interactive agents. Please say hi if you're into about these too. I will be presenting BPT & LQAE posters, plus EAI and RingAttention in workshops.
1
0
17
@haoliuhl
Hao Liu
1 year
@pabbeel 13/ We applied BPT to learn trial-and-errors across trajectories, leveraging the Agentic Transformer and Algorithm Distillation . Our long context model consistently surpass the original Transformer model across all six tasks.
Tweet media one
1
1
17
@haoliuhl
Hao Liu
3 months
LWM can do video generation from text prompts. See website for more video examples:
Tweet media one
Tweet media two
1
0
17
@haoliuhl
Hao Liu
3 months
LWM can answer questions based on over 1 hour long YouTube video, while GPT-4V and Gemini Pro Vision struggle.
Tweet media one
Tweet media two
Tweet media three
3
1
17
@haoliuhl
Hao Liu
1 year
@pabbeel 4/ Transformer's self-attention & position-wise feedforward efficiently capture long-range dependencies, enabling scalability in context length & model size through parallel computations.
1
0
15
@haoliuhl
Hao Liu
1 year
@pabbeel 12/ In terms of speed, using high-level Jax operations, BPT enables high-throughput training that matches or surpasses the speed of vanilla and memory efficient Transformers. Porting our method to low-level kernels in CUDA or Triton will achieve maximum speedup.
Tweet media one
1
0
16
@haoliuhl
Hao Liu
3 months
We propose model-generated QA to address limited long-text data at this stage: We split documents into fixed chunks for our short-context model to generate QA pairs per chunk. Then, we construct long-context examples by merging adjacent chunks and appending QA pairs to the end.
1
0
16
@haoliuhl
Hao Liu
1 year
@pabbeel 9/ In the diagram, we explain our idea by showing that for the bottom first incoming input block, we project it into query; then we iterate over the same input sequence positioned above the bottom row, and project it to key and value.
Tweet media one
1
0
16
@haoliuhl
Hao Liu
1 year
The techniques behind ChatGPT are Supervised finetuning (SFT) and reinforcement learning with human feedback (RLHF). SFT is simple and scalable but cannot use negative feedback. RLHF uses all feedback but is very complex and very hard to tune. Can we go beyond SFT and RLHF?
Tweet media one
1
1
16
@haoliuhl
Hao Liu
3 months
We found that vision-language training needs to mix images, videos, and pure texts together. Without pure texts (e.g. openllama v2 mix), model overfits to vision; without images, video generation has low visual quality since videos often have lower visual quality than images.
1
1
16
@haoliuhl
Hao Liu
3 months
In this work, we propose LWM to model complex million-length language and visual sequences. We curated a large, diverse dataset and utilized RingAttention to scalably train on it. We discover challenges and propose masked sequence packing and model-generated QA to address them.
1
1
16
@haoliuhl
Hao Liu
3 months
We propose masked sequence packing such that each image and text pair only attends to tokens within the pair. Mixing images, texts, and video with standard sequence packing, widely used in current approaches of language model training, leads to very suboptimal model performance.
1
0
16
@haoliuhl
Hao Liu
4 years
💡 We took inspiration from Ng ( @AndrewYNg ) & Jordan 2002, which showed that classifiers trained with a generative loss can outperform classifiers trained with a discriminative loss. Our work can be seen as lifting it into today’s context of training deep NNs. [3/N]
2
2
15
@haoliuhl
Hao Liu
1 year
@pabbeel 8/ To overcome this challenge, we observed that merging the computation of feedforward and attention block by block eliminates the need for performing the feedforward step on the entire sequence, which significantly cut memory cost.
1
0
15
@haoliuhl
Hao Liu
1 year
To conclude: CoH is a simple framework for aligning language models with feedback. The idea is turning all feedback into a sequence to train models. The real world offers many different forms of feedback, which present interesting opportunities for learning in the future.
Tweet media one
1
1
14
@haoliuhl
Hao Liu
3 months
This work provides a highly-optimized, open-source implementation with RingAttention, masked sequence packing, model-generated QA, and other key features for millions-length vision-language training. We have good MFUs even at very large context sizes.
Tweet media one
Tweet media two
1
1
15
@haoliuhl
Hao Liu
1 year
@pabbeel 6/ Rabe et al and FlashAttention Dao et al introduced a memory-efficient attention technique that utilizes the well-established online softmax to compute self-attention block by block, allowing computing exact self-attention with linear memory complexity.
1
0
15
@haoliuhl
Hao Liu
1 year
@pabbeel 15/ This was a fun project. We thank the members of the Berkeley Robot Learning Lab and Berkeley AI Lab as well as Anselm Levskaya, Markus Rabe, Federico Lebron, and Sharad Vikram at Google for their insightful discussions and suggestions.
2
0
15
@haoliuhl
Hao Liu
1 year
This baseline, improved SFT, should outperform SFT due to more human labeled data. But how much does RLHF still outperform improved SFT? Having an answer is helpful in understanding and improving results. It would be great if this baseline could be included in the future. [2/2]
0
0
15
@haoliuhl
Hao Liu
3 months
LWM can answer questions about images.
Tweet media one
1
2
14
@haoliuhl
Hao Liu
1 year
@pabbeel 5/ But quadratic self-attention & large feedforward network require a large amount of memory, challenging scalability for longer input sequences.
1
0
14
@haoliuhl
Hao Liu
2 years
The Jax implementation has been released Some additional features added: -Predicting discretized image tokens from VQGAN as output (similar to BEiT). -Training on a combination of paired image-text data (e.g. CC12M) and unpaired text data (e.g. Wikipedia)
@haoliuhl
Hao Liu
2 years
Excited to share M3AE, a simple but effective model for multimodal representation learning. TLDR: M3AE learns a unified encoder for both vision and language from both paired image-text data as well as unpaired data. w/ @YoungGeng Summary thread: [1/N]
Tweet media one
5
38
249
0
1
14
@haoliuhl
Hao Liu
1 year
LLMs are in-context and multi-task learners after unsupervised learning on broad data. But how to learn from ubiquitous feedback in the real world? ChatGPT and InstructGPT show amazing results by learning from human feedback.
Tweet media one
1
1
14
@haoliuhl
Hao Liu
3 months
Starting with the 1M context language model, we train on mixed formats: images, videos, and texts in diverse formats (text-image, image-text, video-text, text-video, etc.) using autoregressive prediction. Essentially in an any-to-any prediction manner with multiple modalities.
Tweet media one
1
0
14
@haoliuhl
Hao Liu
1 year
@pabbeel 10/ These query, key and value are used to compute self-attention (yellow box), whose output is pass to feedforward network (cyan box), followed by a residual connection.
Tweet media one
1
0
13
@haoliuhl
Hao Liu
1 year
@pabbeel 7/ Despite reduced memory needs in self-attention, a challenge remains with the large parameter count and high-dimensional vectors of the feedforward network. This becomes the primary memory issue when using memory-efficient attention.
1
0
13
@haoliuhl
Hao Liu
1 year
Better summarization. CoH outperforms SFT and RLHF on summarization benchmark. CoH achieves higher scores (left fig) and generates summary that is significantly more preferred by human evaluation (right table) than SFT and RLHF.
Tweet media one
1
0
13
@haoliuhl
Hao Liu
8 months
@HlibIvanov @matei_zaharia @pabbeel Stay tuned! We are interested in training / finetuning large context LLM/VLM with RingAttention.
2
1
13
@haoliuhl
Hao Liu
1 year
CoH just consists of a likelihood function and is simple to implement. It comes with several advantages: 1. More natural type of feedback 2. More natural form for training procedure 3. More effective experimental results CoH outperforms RLHF and SFT in a wide range of tasks.
1
0
12
@haoliuhl
Hao Liu
1 year
Better dialogue. CoH outperforms SFT and RLHF on dialogue benchmark from AnthropicAI human preference dataset. CoH achieves higher accuracy at classifying which dialogue is preferred (left fig) and is substantially more preferred by human (right table) than SFT and RLHF.
Tweet media one
2
0
12
@haoliuhl
Hao Liu
1 year
At inference time: CoH uses positive feedback guides the model to generate the desired outputs, such as "generate a good and informative summary". Since CoH has seen different comparisons, it can follow follow-up instructions such as "generate a better summary".
Tweet media one
1
1
11
@haoliuhl
Hao Liu
2 years
Our method, Forgetful Causal Masking(FCM), combines masked language modeling (MLM) and causal language modeling (CLM) by masking out randomly selected past tokens layer-wisely using attention mask.
Tweet media one
1
2
10
@haoliuhl
Hao Liu
1 year
Better controllable generation. CoH is better at following multi-round instructions than the second best RLHF, for instance "Generate a good summary", "Generate a shorter and more precise summary".
Tweet media one
1
0
10
@haoliuhl
Hao Liu
2 years
Properties of FCM 1. no extra compute cost 2. simple to implement and works 3. scales well to larger models Applying FCM to PaLM trained on C4, it improves zero-shot SuperGLUE performance from 55.7% to 59.2% (1B model) and 61.6% to 64.0% (8B model).
Tweet media one
1
0
8
@haoliuhl
Hao Liu
4 years
🌈 Concurrently, @jimwinkens et al. , @sangwoomo & @BunelR et al. showed that contrastive learning(e.g. SimCLR) improves OOD detection of classifiers! However, HDGE's contrastive loss term doesn't rely on data augmentation. [9/N]
2
0
8
@haoliuhl
Hao Liu
2 years
Thanks for the attention. Check out the paper for more details. We are excited to apply this technique to improve large language models and beyond. All comments and feedback are welcome.
1
0
8
@haoliuhl
Hao Liu
2 years
Our largest 8B model matches the score of PaLM with an average score of 64%, despite the fact that PaLM is trained on a much larger dataset (780B tokens) of high-quality conversation and webpage data, while ours is trained on the smaller C4 dataset (180B tokens).
Tweet media one
2
0
8
@haoliuhl
Hao Liu
7 months
The links of great work mentioned above: Alpaca(), Vicuna(), AlpacaEval(), MTBench().
0
0
8
@haoliuhl
Hao Liu
1 year
The idea is to encode image as sequences of text tokens by directly quantizing image embeddings using a pretrained BERT codebook. We then apply random masking followed by a BERT model, and have the decoder reconstruct the original image from BERT predicted text token embeddings.
Tweet media one
1
1
8
@haoliuhl
Hao Liu
1 year
During training time: CoH randomly samples one or multiple model outputs and use them to form a sentence consists of both positive and negative feedback in the form of comparison, such as "The following is a bad summary" and "The following summary is better".
Tweet media one
1
1
8
@haoliuhl
Hao Liu
7 months
Just like prior AI breakthroughs which are based on open and high quality datasets (thanks ImageNet/Atari/Wikipedia), in order to advance research on language models, we probably also need crowd-sourced human feedback datasets that are built with open source models.
1
1
7
@haoliuhl
Hao Liu
7 months
In Feb, we proposed CoH, a SFT based alternative to RL-based RLHF. We were excited to see that such a straightforward conditional training appears to outperform RL-based RLHF on public human feedback dataset such as Anthropic's HH dataset.
@haoliuhl
Hao Liu
1 year
Humans learn from rich feedback in the form of language. Why not turning all feedback into a sentence to train models? We propose CoH: Just tell models which ones are not good and which ones are better. Better than SFT and RLHF on summary and dialogue tasks.
Tweet media one
13
120
637
1
0
7
@haoliuhl
Hao Liu
1 year
A higher masking ratio than normal is necessary for good downstream performance, as standard language denoisers such as BERT commonly use a masking ratio of 15%, where LQAE performance is highest at around 50% masking ratio.
Tweet media one
2
0
7
@haoliuhl
Hao Liu
2 years
📝 Due to its flexibility and scalability, M3AE is especially suitable for learning from extremely large-scale datasets, and we envision that such pre-trained models can be broadly applicable in many practical downstream tasks, such as visual reasoning and RL. [10/N]
1
0
7
@haoliuhl
Hao Liu
10 months
Blockwise Parallel Transformer (BPT) reorganizes computation of transformer to reduce memory cost of transformers to linear w/o modifying architecture. Jax code for training long context LLaMA using BPT.
0
1
6
@haoliuhl
Hao Liu
2 years
Thanks to the free MLM training, not only FCM improves zero-shot learning, it also improves finetuning performance, from 67.0% to 68.7% (1B model) and 81.0% to 82.5% (8B model).
Tweet media one
1
0
6
@haoliuhl
Hao Liu
1 year
Language models are trained on text corpus, fundamentally lack visual perception -- a crucial attribute needed to extend these models to be able to interact with the real world and solve vision tasks, such as in visual-question answering and robotics.
Tweet media one
1
0
6