Rulin Shao @RulinShao Twitter profile

Pinned Tweet

Rulin Shao

8 months

Introduce LightSeq for long-context LLM training: - Highly optimized for decoder models - smarter checkpointing - better support for fewer heads models up to 2x faster, 2-8x longer sequences vs Megatron-LM.

7

93

378

Last Seen Profiles

@mizutamanokaori

@organizejs

@svddencat

@luisportelles_

@stw_pdg

@Ianeul

@TheIcaPica

@Geum_Ya_9

@sketchyfnf

@breketeConnect

@msfiiire_

@AcctKyle

@ShengYong32812

@Therealdealwa

@mayu_tokage

@Ediz_ow

@MaRinS_GP

@jasonselvig

@EY_Germany

@jandakembangstw

@SaudiHockey

@N_Schmid

@ControleY

@ma_rukan

@shka_marie

@gordic_aleksa

@Ronamin_OW

@virgoghoulette

@kenyonoblad

@dedektifpuro

@reneereyeslaw

@Kronus_OW

@Draydinimanov

@nocontext_marun

@_JoshJones4

@sam_brown7

Rulin Shao

@RulinShao

1 year

Introducing LongChat and LongEval. Check our new models and benchmark for long context chatbots!

lmsys.org

@lmsysorg

1 year

🔥Introducing LongChat🤖, our new chatbots supporting 16K tokens context, and LongEval, our new benchmark for testing long context chatbots. 🤥Surprisingly, we found open LLMs often fail to achieve their promised context length. Check our blog for details:

4

106

473

0

1

21

Rulin Shao

@RulinShao

11 months

Definitely a huge need here: Many people asked if our MPCFormer (ICLR'23) will allow them to use commercial models like ChatGPT in their business w/o revealing any confidential data to the server 🤔 Sure it can, but we have a long way to go in terms of generation speed...

AK

@_akhaliq

11 months

PUMA: Secure Inference of LLaMA-7B in Five Minutes paper page: With ChatGPT as a representative, tons of companies have began to provide services based on large Transformers models. However, using such a service inevitably leak users' prompts to the

5

31

176

0

1

17

Rulin Shao

@RulinShao

8 months

The paper is on: Code preview: . We're preparing to release end-to-end training code soon. The work is done by amazing collaborators @DachengLi177 , @RulinShao , Anze Xie, @ericxing , @profjoeyg , Ion Stoica, @MaxMa1987 , @haozhangml

GitHub - RulinShao/LightSeq: Official repository for LightSeq: Sequence Level Parallelism for...

Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers - RulinShao/LightSeq

github.com

0

3

15

Rulin Shao

@RulinShao

8 months

When developing DistAttn, we discovered a better grad checkpointing strategy in the presence of FlashAttention (FA). This is because FA does rematerialization inside its backward kernel, which makes recomputation redundant. More interestingly, this applies to any cases with FA.

1

0

13

Rulin Shao

@RulinShao

6 months

Check out this awesome release! Everything is open-sourced, the model, training codes, and data!

LLM360

@llm360

6 months

🚀 1/7 We are thrilled to launch LLM360 — pushing the frontier of open-source & transparent LLMs! Starting with Amber (7B) & CrystalCoder (7B), we are releasing brand new pre-trained LLMs with all training code, data, and up to 360 model checkpoints. 🔗

19

191

1K

0

1

12

Rulin Shao

@RulinShao

10 months

Looking for a benchmark for vision-language instruction-following? 👇

AK

@_akhaliq

10 months

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use paper page: introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use.

1

73

289

0

11

Rulin Shao

@RulinShao

8 months

LightSeq features distributed attention (DistAttn). It splits the input seq into chunks and distributes the computation of each chunk to one GPU. All modules but attention are embarrassingly parallelized. DistAttn communicates keys and values to complete the computation.

1

0

10

Rulin Shao

@RulinShao

10 months

✨check it out — LongChat now supports 32k context length based on LLaMa2

Dacheng Li

@DachengLi177

10 months

Along with Vicuna-v1.5, we also released LongChat-v1.5, based on Llama-2 and 32k context length. You can try it in FastChat or evaluate it in the LongChat repo !

1

18

79

0

10

Rulin Shao

@RulinShao

8 months

Our design also enables overlapping communication with computation. Experimental results show that substantial communication can be hidden in LightSeq.

1

0

6

Rulin Shao

@RulinShao

8 months

We note that the workload in LLMs is imbalanced as the later tokens have a longer context to attend to, causing computation bubbles in sequence parallelism. To fix it, we designed a load-balancing algorithm letting the bubble workers help the busy workers.

1

0

6

Rulin Shao

@RulinShao

8 months

Our experiments show that LightSeq has faster training speed and better support for LLMs with different heads than Megatron-LM on both intra- and inter-node training with NVLINK and Infiniband.

1

0

5

Rulin Shao

@RulinShao

8 months

LightSeq solely uses sequence parallelism and thus doesn’t pose any assumptions on the model architecture like the number of heads. Therefore, LightSeq can scale beyond the number of heads and easily handle models with only a few attention heads.

1

0

4

Rulin Shao

@RulinShao

1 year

Check our work on vision-language compositional reasoning at ACL2023!

Rohan Pandey (e/acc)

@khoomeik

1 year

I'll be at #ACL2023 in a few weeks to present CACR, a self-supervised objective that encourages vision-language relation alignment and improves performance on compositionality benchmarks.