Rulin Shao Profile Banner
Rulin Shao Profile
Rulin Shao

@RulinShao

649
Followers
410
Following
7
Media
55
Statuses

PhD @UWNLP | MS @SCSatCMU | ex-Applied Scientist @AWS

Joined April 2022
Don't wanna be here? Send us removal request.
Pinned Tweet
@RulinShao
Rulin Shao
8 months
Introduce LightSeq for long-context LLM training: - Highly optimized for decoder models - smarter checkpointing - better support for fewer heads models up to 2x faster, 2-8x longer sequences vs Megatron-LM.
7
93
378
@RulinShao
Rulin Shao
1 year
Introducing LongChat and LongEval. Check our new models and benchmark for long context chatbots!
@lmsysorg
lmsys.org
1 year
🔥Introducing LongChat🤖, our new chatbots supporting 16K tokens context, and LongEval, our new benchmark for testing long context chatbots. 🤥Surprisingly, we found open LLMs often fail to achieve their promised context length. Check our blog for details:
Tweet media one
4
106
473
0
1
21
@RulinShao
Rulin Shao
11 months
Definitely a huge need here: Many people asked if our MPCFormer (ICLR'23) will allow them to use commercial models like ChatGPT in their business w/o revealing any confidential data to the server 🤔 Sure it can, but we have a long way to go in terms of generation speed...
@_akhaliq
AK
11 months
PUMA: Secure Inference of LLaMA-7B in Five Minutes paper page: With ChatGPT as a representative, tons of companies have began to provide services based on large Transformers models. However, using such a service inevitably leak users' prompts to the
Tweet media one
5
31
176
0
1
17
@RulinShao
Rulin Shao
8 months
When developing DistAttn, we discovered a better grad checkpointing strategy in the presence of FlashAttention (FA). This is because FA does rematerialization inside its backward kernel, which makes recomputation redundant. More interestingly, this applies to any cases with FA.
Tweet media one
1
0
13
@RulinShao
Rulin Shao
6 months
Check out this awesome release! Everything is open-sourced, the model, training codes, and data!
@llm360
LLM360
6 months
🚀 1/7 We are thrilled to launch LLM360 — pushing the frontier of open-source & transparent LLMs! Starting with Amber (7B) & CrystalCoder (7B), we are releasing brand new pre-trained LLMs with all training code, data, and up to 360 model checkpoints. 🔗
19
191
1K
0
1
12
@RulinShao
Rulin Shao
10 months
Looking for a benchmark for vision-language instruction-following? 👇
@_akhaliq
AK
10 months
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use paper page: introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use.
Tweet media one
1
73
289
0
0
11
@RulinShao
Rulin Shao
8 months
LightSeq features distributed attention (DistAttn). It splits the input seq into chunks and distributes the computation of each chunk to one GPU. All modules but attention are embarrassingly parallelized. DistAttn communicates keys and values to complete the computation.
Tweet media one
1
0
10
@RulinShao
Rulin Shao
10 months
✨check it out — LongChat now supports 32k context length based on LLaMa2
@DachengLi177
Dacheng Li
10 months
Along with Vicuna-v1.5, we also released LongChat-v1.5, based on Llama-2 and 32k context length. You can try it in FastChat or evaluate it in the LongChat repo !
1
18
79
0
0
10
@RulinShao
Rulin Shao
8 months
Our design also enables overlapping communication with computation. Experimental results show that substantial communication can be hidden in LightSeq.
Tweet media one
1
0
6
@RulinShao
Rulin Shao
8 months
We note that the workload in LLMs is imbalanced as the later tokens have a longer context to attend to, causing computation bubbles in sequence parallelism. To fix it, we designed a load-balancing algorithm letting the bubble workers help the busy workers.
Tweet media one
1
0
6
@RulinShao
Rulin Shao
8 months
Our experiments show that LightSeq has faster training speed and better support for LLMs with different heads than Megatron-LM on both intra- and inter-node training with NVLINK and Infiniband.
Tweet media one
1
0
5
@RulinShao
Rulin Shao
8 months
LightSeq solely uses sequence parallelism and thus doesn’t pose any assumptions on the model architecture like the number of heads. Therefore, LightSeq can scale beyond the number of heads and easily handle models with only a few attention heads.
Tweet media one
1
0
4
@RulinShao
Rulin Shao
1 year
Check our work on vision-language compositional reasoning at ACL2023!
@khoomeik
Rohan Pandey (e/acc)
1 year
I'll be at #ACL2023 in a few weeks to present CACR, a self-supervised objective that encourages vision-language relation alignment and improves performance on compositionality benchmarks.
Tweet media one
1
0
7
0
1
4
@RulinShao
Rulin Shao
6 months
@AkariAsai Congrats!
0
0
3
@RulinShao
Rulin Shao
6 months
@IanMagnusson Congrats on the release! Great work😍
1
0
2