Piotr Nawrot @p_nawrot Twitter profile | Pikagi

Pikagi

Piotr Nawrot

@p_nawrot

3,045

Followers

225

Following

20

Media

260

Statuses

PhD student in #NLProc @Edin_CDT_NLP | Previously intern @Nvidia & @MetaAI

Warsaw

https://t.co/S5ySbfjd35

Joined July 2014

Don't wanna be here? Send us removal request.

Pinned Tweet

@p_nawrot

Piotr Nawrot

2 months

The memory in Transformers grows linearly with the sequence length at inference time. In SSMs it is constant, but often at the expense of performance. We introduce Dynamic Memory Compression (DMC) where we retrofit LLMs to compress their KV cache while preserving performance…

Tweet media one

7

70

395

Last Seen Profiles

@Waeltwt

@RhayPrimaryHead

@Yo_NeN_

@444deadlysins

@TVP_ReadingFest

@NewDealChief

@SoCalBirds

@OhsoHuisman

@vayarys

@ban8ku

@WorldMartiniDay

@SnyderJero56180

@SenseiLeGlove

@vcsreall

@SonOfIllyria

@irys_en

@baffa_maim72198

@hotepbowser1865

@NoticiasRCN

@kalobees

@SurfsBri

@ShowYourLEGOonX

@Sondertude

@Shiothesnowwolf

@SAGULionMBB

@SimaenagaEikyu

@RallyRoundRFKJr

@SenseiLeGlove

@RobertZeltinsh

@RealRodrigoPolo

@SnyderJero56180

@Parisadafi

@PolybiusArcadia

@SSGPrinceVegeta

@Return0fTheMari

@yfiunia

@p_nawrot

Piotr Nawrot

10 months

🎇Introducing *nanoT5 v2*🎇 Inspired by @karpathy 's #nanoGPT , we improve the repo for pre-training T5 model in PyTorch. In ~16 hours on a single GPU, we achieve 40.7 RougeL on the SNI benchmark, compared to 40.9 RougeL of the original model pre-trained on 150x more data!

Tweet media one

7

83

476

@p_nawrot

Piotr Nawrot

1 year

Introducing *nanoT5* Inspired by @jonasgeiping 's Cramming and @karpathy 's nanoGPT, we fill the gap of a repository for pre-training T5-style "LLMs" under a limited budget (1xA100 GPU, ~20 hours) in PyTorch 🧑‍💻 @EdinburghNLP

Tweet media one

8

81

460

@p_nawrot

Piotr Nawrot

1 year

Great news! “Efficient Transformers with Dynamic Token Pooling” has been accepted to #ACL23 ! We increase the efficiency *and* performance of Transformer LM by jointly segmenting and modelling language. @PontiEdoardo @AdrianLancucki @JChorowski 📜

Tweet card media

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the...

@PontiEdoardo

Edoardo Ponti

1 year

Can we increase the efficiency *and* performance of auto-regressive models? We introduce dynamic-pooling Transformers, which jointly perform language modelling and token segmentation. @p_nawrot * @AdrianLancucki @JChorowski 📜 🧑‍💻

2

23

92

1

26

146

@p_nawrot

Piotr Nawrot

2 months

I'll jump on the hype train and quote Andrej too since I have been tweeting about tokenisers for quite some time now. I believe that Dynamic Token Pooling Transformers () we've authored is a good example of a tokenisation-free network. Specifically, we…

Tweet card media

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the...

@karpathy

Andrej Karpathy

3 months

New (2h13m 😅) lecture: "Let's build the GPT Tokenizer" Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and…

Tweet media one

383

2K

14K

5

17

137

@p_nawrot

Piotr Nawrot

7 months

nanoT5 got accepted to NLP - Open Source Software Workshop at #EMNLP2023 🎇 You can access the report about the repo here: More work about efficient methods for LLMs coming soon! 👀 See you in Singapore!

Tweet card media

nanoT5: A PyTorch Framework for Pre-training and Fine-tuning...

State-of-the-art language models like T5 have revolutionized the NLP landscape, but their computational demands hinder a large portion of the research community. To address this challenge, we...

@p_nawrot

Piotr Nawrot

10 months

🎇Introducing *nanoT5 v2*🎇 Inspired by @karpathy 's #nanoGPT , we improve the repo for pre-training T5 model in PyTorch. In ~16 hours on a single GPU, we achieve 40.7 RougeL on the SNI benchmark, compared to 40.9 RougeL of the original model pre-trained on 150x more data!

Tweet media one

7

83

476

1

16

96

@p_nawrot

Piotr Nawrot

8 months

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models () accepted to @NeurIPSConf ! I'm very proud of this work : ) Big congrats to @jeankaddour , @oscar__key , @PMinervini , and Matt J. Kusner!

Tweet card media

No Train No Gain: Revisiting Efficient Training Algorithms For...

The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve...

@jeankaddour

Jean Kaddour

10 months

📢The costs for training (L)LMs skyrocketed 🚀 in recent years, motivating efficient training algorithms. However, when pre-training BERT and T5 models with a fixed compute budget, we find their gains vanish compared to a baseline with a fully-decayed learning rate! 1/5

Tweet media one

2

28

129

5

7

79

@p_nawrot

Piotr Nawrot

9 days

Two free medium-compute Mixture-Of-Experts research ideas: Prerequisite: Mixtral 8x7B is 32 layers, at each layer there are 8 experts, each token is assigned to 2 experts at a given layer. 1) Dynamic Expert Assignment in MoE Models Every token is assigned to 2*32=64 experts in…

5

12

74

@p_nawrot

Piotr Nawrot

8 months

So glad to see the community getting interested in tackling tokenization! For anyone interested in this direction check out (). We're training a character-level LM that learns how to tokenize the input end-to-end with the model.

Tweet card media

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the...

@jxmnop

jack morris

8 months

if I were starting my research career today and interested in language models, I would become a world expert on tokenization & tokenizers tokenization is weird, fascinating, and poorly understood, yet ubiquitous and necessary for things like chatGPT to work

6

6

161

1

5

58

@p_nawrot

Piotr Nawrot

8 months

My team at #Nvidia is looking for a Research Engineer to work on efficient Conversational AI (LLM/ASR/TTS) models. Position is remote in Europe, or on-site in Warsaw, Poland. More details: I can provide a referral : )

2

6

54

@p_nawrot

Piotr Nawrot

8 months

For anyone interested in this direction go check out (). We're training character-level LM which jointly learns how to dynamically segment the characters and to do language modeling. We get a faster and better model than Transformer-XL baseline!

Tweet card media

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the...

@iamtrask

Andrew Trask

8 months

For anyone interested in future LLM development One of the bigger unsolved deep learning problems: learning of hierarchical structure Example: we still use tokenizers to train SOTA LLMs. We should be able to feed in bits/chars/bytes and get SOTA Related: larger context window

19

76

523

3

9

54

@p_nawrot

Piotr Nawrot

18 days

@yoavgo While working on () we discovered that we're able to retain many metrics including perplexity and many downstream tasks for very high compression ratios. Then we evaluated on MMLU and the score was terrible. From that point on our goal changed to getting…

2

1

48

@p_nawrot

Piotr Nawrot

5 months

There is fully-e2e network+tokeniser training already ()! We add dynamic tokeniser to the Transformer-XL and learn jointly to segment characters and do generative language modelling. In late-Dec we'll be releasing a large-scale follow-up, stay tuned :)

Tweet card media

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the...

@andrew_n_carr

Andrew Carr (e/🤸)

5 months

There's a weird reality that we mostly ignore in language modeling. It's the fact that we don't _actually_ train these models end-to-end. That's because we have the tokenizer! It's actually a really frustrating piece to tune with sometimes small changes mattering a lot and…

Tweet media one

19

18

234

4

4

44

@p_nawrot

Piotr Nawrot

5 months

👨‍💻Random LLM engineering question👨‍💻 Is there any difference between these approaches for computing the attention weights? The former is more widely adopted (correct me if I'm wrong), but the latter is faster if min(Q_len, K_len) * D_head < Q_len * K_len which is like always?

Tweet media one

5

4

45

@p_nawrot

Piotr Nawrot

2 months

Great read (as always) about the KV-Cache compression which very soon will become necessary given that we are able to reason over longer and longer contexts (10M of Gemini). Also big shout-out to @Francis_YAO_ for including our Dynamic Memory Compression work in the analysis.

@Francis_YAO_

Yao Fu

2 months

We are in the age of 100K+ context window, but how does the language model attend to 100K tokens exactly? In this post, we identify the six common attention patterns across layers and heads, aiming to provide a first intuition for kv cache compression.

6

66

361

0

5

41

@p_nawrot

Piotr Nawrot

9 months

I came across this blog that digs down into (Transformer) LLM's inference from the hardware side, and I truly believe that it's a must-read for everyone working on efficient Transformers 👏

@kipperrii

kipply

2 years

transformer inference performance is becoming increasingly important and there's not as much lore on it, so here is a lot of lore that i think fully models llm inference performance

6

65

491

0

5

39

@p_nawrot

Piotr Nawrot

10 months

Codebase is available at:

Tweet card media

GitHub - PiotrNawrot/nanoT5: Fast & Simple repository for pre-training and fine-tuning T5-style...

Fast & Simple repository for pre-training and fine-tuning T5-style models - PiotrNawrot/nanoT5

0

1

35

@p_nawrot

Piotr Nawrot

9 months

Do Efficient Training Algorithms / Optimizers really save us compute when training Transformer LMs? 🧐 Check out our latest work where we put some of these to the test! PS. Thanks to this work I managed to further tune the nanoT5 baseline () 😇

Tweet card media

GitHub - PiotrNawrot/nanoT5: Fast & Simple repository for pre-training and fine-tuning T5-style...

Fast & Simple repository for pre-training and fine-tuning T5-style models - PiotrNawrot/nanoT5

@jeankaddour

Jean Kaddour

10 months

📢The costs for training (L)LMs skyrocketed 🚀 in recent years, motivating efficient training algorithms. However, when pre-training BERT and T5 models with a fixed compute budget, we find their gains vanish compared to a baseline with a fully-decayed learning rate! 1/5

Tweet media one

2

28

129

1

9

30

@p_nawrot

Piotr Nawrot

9 months

Check out the follow-up work "Efficient Transformers with Dynamic Token Pooling" which improves upon the Hourglass architecture with a learnable module that dynamically segments the input sequence end-to-end with the model:

@ChrSzegedy

Christian Szegedy

9 months

Nice work!

3

10

56

1

8

27

@p_nawrot

Piotr Nawrot

9 months

Amazing work with this single-GPU repo. They fine-tuned a 32K context 3B LLaMA model in under 48 hours on just one A100. It's crazy to observe this LLM progress! Great job @CStanKonrad @s_tworkowski

@s_tworkowski

Szymon Tworkowski

9 months

🎇Introducing LongLLaMA-Instruct 32K!🎇 Inspired by @p_nawrot #nanoT5 , we fine-tune LongLLaMA- on a *single GPU* for ~48h to improve upon OpenLLaMA: 55% on lm-eval (vs. 53%), better perf on long context and code! We open-source our optimized fine-tuning code in PyTorch/HF!🧵

Tweet media one

9

78

309

0

4

23

@p_nawrot

Piotr Nawrot

2 months

[2/n] The core idea of DMC at inference time is that the KV representation of current token is: - appended to the cache (as in vanilla Transformers); or - accumulated (weighted-averaged) with the last item in the cache. At training time DMC operates in a second mode where…

1

3

22

@p_nawrot

Piotr Nawrot

10 months

Excited about new Transformer Language Model variants that offer joint segmentation and language modeling? Join us tomorrow for our presentation on "Efficient Transformers with Dynamic Token Pooling" at 11 am poster session at #ACL2023 . Can't wait to discuss it with you there

1

5

21

@p_nawrot

Piotr Nawrot

7 months

@giffmana Don't forget that there is already learnable tokenisers trained e2e with the LM :) Ref:

Tweet card media

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the...

2

1

20

@p_nawrot

Piotr Nawrot

5 months

Everyone's invited to stop by the poster session of NLP-OSS Workshop at #EMNLP where you can see this piece of art poster by yourself in-person This is the last post about nanoT5 from me, if you haven't seen it check out Thanks for all the kind feedback!

Tweet media one

0

1

19

@p_nawrot

Piotr Nawrot

5 months

I am coming to Singapore 🇸🇬 for #EMNLP2023 Please drop me a message if you would like to connect or discuss any of the following: - Trainable tokenisers - Efficient Transformers - Any kind of adaptive computation - Long context modelling - LLM Scaling Can't wait to see ya :)!

0

1

19

@p_nawrot

Piotr Nawrot

10 months

Happy to share a video presentation of our work “Efficient Transformers with Dynamic Token Pooling", which has been accepted to #ACL2023 . We increase the efficiency *and* performance of Transformer LMs by jointly segmenting and modelling language. 📽️

Tweet media one

0

1

17

@p_nawrot

Piotr Nawrot

2 months

[5/n] Finally, as DMC makes independent decisions for each head / layer, it opens a window into the internal mechanisms of the LLM. We find specific regions of layers that compress the most (so most of the original information is redundant), such as between the middle and the…

Tweet media one

2

1

18

@p_nawrot

Piotr Nawrot

8 months

More evidence that including code in the pre-training mixture is essential!

@s_tworkowski

Szymon Tworkowski

8 months

✨Announcing LongLLaMA-Code 7B!✨ Have you wondered how GPT3.5 obtained its capability? Are base models of code better reasoners? 🤔 We continue pre-training CodeLLaMA on text & code to improve reasoning 🧠 Bonus: 3x faster inference @ 16K context, using Focused Transformer 🎯

Tweet media one

5

52

312

0

0

16

@p_nawrot

Piotr Nawrot

1 year

@jonasgeiping @karpathy @EdinburghNLP In nanoT5, we expose (for research purposes) and optimise everything in the training pipeline of T5 except from model implementation. Among others we use: - C4 Dataset streaming - PyTorch 2.0 compile - TF32 operations - AdamW with RMS scaling

0

0

16

@p_nawrot

Piotr Nawrot

7 months

I wanted to test this idea some time ago. We know that training on Code improves LLM's CoT and reasoning abilities. Game trajectories are a source of long sequences which require a lot of reasoning and context comprehension to model well. I'm curious to see the first results!

@laion_ai

LAION

7 months

We Release... 608 B chess moves, 236 B Rubik's Cube moves, 39 B A* moves in ASCII Mazes ... to improve planing abilities of LLMs:

20

94

577

0

1

15

@p_nawrot

Piotr Nawrot

3 months

Check tokenizers in your LLMs! Latest findings from @__gautier__ et al: 1. You can swap tokenizer in your pre-trained base model with little impact on downstreams (via fine-tuning) 2. Vocabulary size has little impact on downstreams. I wonder if we would reach the same…

@__gautier__

Gautier Dagan

3 months

PSA: Check your tokenizers! We find most code LLMs fine-tuned from a pre-trained NL model to be suboptimal for code. Preprint: This research was done during my internship @AIatMeta with @b_roziere and @syhw 1/8

Tweet media one

3

37

179

0

1

13

@p_nawrot

Piotr Nawrot

2 months

[3/n] 2x and 4x compression of the KV cache preserves (or even increases!) the performance of the original LLM (such as LLama 7B / 13B / 70B) in factuality, commonsense question answering, and coding. Not only is DMC far superior to GQA, but it can be also compounded with it:…

Tweet media one

2

2

12

@p_nawrot

Piotr Nawrot

1 year

@jonasgeiping @karpathy @EdinburghNLP Despite the continuously increasing size of pretrained Transformers, the research community still needs easy-to-reproduce and up-to-date baselines to test new hypotheses fast and at a small scale. To the best of our knowledge, there's no repository that reproduces T5 in PyTorch.

0

0

12

@p_nawrot

Piotr Nawrot

26 days

I'm trying to tackle a (impossible?) task of keeping up with the long-context LLM evaluation field and below is a list of recent papers I've found that introduce some new long context evaluation schema / dataset. Please give me a hand to have this list up-to-date, at least for…

Tweet media one

4

4

13

@p_nawrot

Piotr Nawrot

8 months

Whoah, thanks for this recognition : )

@bhutanisanyam1

Sanyam Bhutani

@bhutanisanyam1

8 months

nanoT5: T5 model pre-training for GPU Poor! 🙏 @p_nawrot has kindly open sourced an implementation of T5-1.1 making pre-training of the model approachable on a single GPU It uses @PyTorch 2.0 and the code is very readable:

Tweet media one

2

34

174

0

1

11

@p_nawrot

Piotr Nawrot

20 days

It’s common practise to quantise LLMs to {4, 8}-bit to increase throughput/latency at a small cost to model accuracy. I was thinking how quantisation behaves for long-context scenarios (>100k) where there's a lot of tokens to process but e.g. so little values to encode your…

4

1

12

@p_nawrot

Piotr Nawrot

9 months

I wholeheartedly recommend Edoardo as a supervisor!

@PontiEdoardo

Edoardo Ponti

9 months

We have re-opened 2 PhD studentships for *2023/24* at @EdinburghNLP (1 home, 1 international), please send me a message by tomorrow if you are interested in this opportunity!

4

22

50

0

0

10

@p_nawrot

Piotr Nawrot

2 months

[4/n] This translates in practice into reduced latency and boosted throughput: now we can fit in memory much larger batches (x-axis) and/or longer examples! Thanks to an efficient implementation in Triton, the throughput gains (y-axis) in practice reach the theoretical limits…

Tweet media one

1

1

11

@p_nawrot

Piotr Nawrot

1 year

@jonasgeiping @karpathy @EdinburghNLP We make our codebase, configs and [pre-training, fine-tuning] logs publicly available to enhance the accessibility of NLP research. We are keen to hear your suggestions to improve the codebase further. Thanks to @PontiEdoardo for his early feedback!

0

0

11

@p_nawrot

Piotr Nawrot

3 months

A new bible for everyone interested in MoE models! Amazing job @XueFz @Francis_YAO_ @NiJinjie

@XueFz

Fuzhao Xue

3 months

(1/5)🚀 Our OpenMoE Paper is out! 📄 Including: 🔍ALL Checkpoints 📊 In-depth MoE routing analysis 🤯Learning from mistakes & solutions Three important findings: (1) Context-Independent Specialization; (2) Early Routing Learning; (3) Drop-towards-the-End. Paper Link:…

Tweet media one

5

107

519

0

2

11

@p_nawrot

Piotr Nawrot

8 months

@Francis_YAO_ It’s very very true. Right now (at the beginning of the PhD) I feel I need some publications to get minimum recognition, but then, after some amount of conference papers / citations, you should definitely prioritize fun over bigger numbers

0

0

9

@p_nawrot

Piotr Nawrot

7 months

Wow, this is huge! Flash Attention is now parallelised over the KV-axis!

@tri_dao

Tri Dao

7 months

Announcing Flash-Decoding, to make long-context LLM inference up to 8x faster! Great collab with @d_haziza , @fvsmassa and Grigory Sizov. Main idea: load the KV cache in parallel as fast as possible, then separately rescale to combine the results. 1/7

Tweet media one

9

151

745

0

0

10

@p_nawrot

Piotr Nawrot

1 year

@jonasgeiping @karpathy @EdinburghNLP To evaluate our model, we use the popular meta-dataset called Super Natural-Instructions (SNI), which aggregates datasets for many tasks. We achieve ~40 RougeL on the SNI test set, compared to ~42 RougeL of the original model available on HuggingFace Hub.

Tweet media one

1

0

9

@p_nawrot

Piotr Nawrot

8 months

@josiahjdavis @jxmnop Chech this out :) (). We're training a character-level LM that learns how to tokenize the input end-to-end with the model.

Tweet card media

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the...

2

0

9

@p_nawrot

Piotr Nawrot

3 months

@jxmnop I'd really want to agree here with you and live in a world where (almost) all that matters is the data quality but it's not true. For example, idk if you remember but I told you about this effort of mine towards reproducing T5 pre-training in PyTorch. Me, and some other attempts…

1

0

9

@p_nawrot

Piotr Nawrot

10 months

We share the configs, checkpoints, training logs, as well as our negative attempts towards improving pre-training efficiency. Advanced optimizers like Lion, Sophia, ALiBi positional embeddings, and FP16 mixed precision training didn't yield expected benefits.

2

0

9

@p_nawrot

Piotr Nawrot

10 months

Key upgrade in nanoT5 v2: We've leveraged BF16 precision and utilise a simplified T5 model implementation based on Huggingface's design. New implementation is easy-to-read and compatible with the HF's checkpoints. Pre-training is now 2x faster than our previous version. 🚀

Tweet media one

1

0

8

@p_nawrot

Piotr Nawrot

1 year

@jonasgeiping @karpathy @EdinburghNLP We start from the randomly initialised T5-base-v1.1 (248M parameters) implemented in HuggingFace. Next, we pre-train it on the English subset of the C4 dataset. With several ablations, we choose the most optimal choice of LR Scheduler / Optimizer / Batch Size for our hardware.

Tweet media one

0

0

8

@p_nawrot

Piotr Nawrot

5 months

Let the games begin! Looking forward to seeing multi-modal rise in 2024.

@xiangyue96

Xiang Yue

5 months

🚀 Introducing MMMU, a Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. 🧐 Highlights of the MMMU benchmark: > 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks >…

Tweet media one

Tweet media two

Tweet media three

Tweet media four

18

187

734

0

2

8

@p_nawrot

Piotr Nawrot

5 months

Don't miss hottest NeurIPS gem on long context LLMs!

@s_tworkowski

Szymon Tworkowski

5 months

Honored to win Poland's best CS master thesis prize for my work on long context LLM w/ @PiotrRMilos 🎉 Can't make it to #NeurIPS2023 😭, but @CStanKonrad will present LongLLaMA paper tmr! Thu 10:45, Poster #326 , Session 5 Interested in extending context to 256K? Come and say hi!

Tweet media one

3

32

92

0

1

7

@p_nawrot

Piotr Nawrot

5 months

Is it because of low precision (fp8 / bf16) operations and arithmetic underflow and we get better precision if we multiply/add larger numbers and normalise them in the end? I evaluated both approaches with the most recent flash-attention and they're (+-eps) equal.

0

0

7

@p_nawrot

Piotr Nawrot

3 months

New PEFT method based on sparse fine-tuning that allows you to push the limits to what you can fine-tune on your local GPU. (you have one in a lifetime opportunity to be the first person to post this hot news to your corporate papers channel on Slack so that you can reorganise…

@PontiEdoardo

Edoardo Ponti

3 months

We scaled sparse fine-tuning (SFT) to LLMs (such as Llama 2) by making it both parameter- and memory-efficient! (q)SFT instruction tuning performance is often better than (q)LoRA with comparable speed and memory load. Paper: Code:…

2

71

254

0

0

6

@p_nawrot

Piotr Nawrot

10 months

@karpathy I've just come across this Tweet, so sorry for the late reply, but you can check this work () where we propose the LM-variant which starts with characters and learns to segment the sequence dynamically (variable groups) and e2e as it goes through the model.

Tweet media one

0

0

6

@p_nawrot

Piotr Nawrot

7 months

@XueFz Just wanted to appreciate the quality of papers that you are sharing, keep up this bar! :)

1

0

5

@p_nawrot

Piotr Nawrot

2 months

@tancool_ @jxmnop I think that Dynamic Token Pooling we’ve authored is some “real example” of tokenisation-free Transformer that works with autoregressive language models - . In this work we predict the segmentation of a character-level sequence :) Let me know if you have…

@PontiEdoardo

Edoardo Ponti

1 year

Can we increase the efficiency *and* performance of auto-regressive models? We introduce dynamic-pooling Transformers, which jointly perform language modelling and token segmentation. @p_nawrot * @AdrianLancucki @JChorowski 📜 🧑‍💻

2

23

92

0

0

4

@p_nawrot

Piotr Nawrot

10 months

We test different pre-training durations: 4, 8, 12, 16, 20, and 24 hours. Result? A sweet spot at 16 hours! It has comparable performance to the original model trained on 150x more data! Time & Compute-efficient, and no compromise on quality.

Tweet media one

1

0

4

@p_nawrot

Piotr Nawrot

5 months

@andrew_n_carr Take a look at () where we add trainable "input-conditioned tokeniser" to the Transformer-XL and get generative model which jointly learns to segment characters and do language modelling. Feedback welcomed and soon we'll release a large-scale follow-up :)

Tweet card media

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the...

1

0

5

@p_nawrot

Piotr Nawrot

5 months

PS. I know that it's a point-wise operation, but we're at this stage where we optimise everything during LLM training and it's the largest tensor in the graph :)

0

0

5

@p_nawrot

Piotr Nawrot

18 days

@arthurmensch Why there's no comparison to Command R+ on multilingual performance? I believe that Llama is much weaker than Cohere's model according to multiple sources.

0

0

5

@p_nawrot

Piotr Nawrot

9 days

I came up with both of these ideas more than a half year ago and I didn't have time to act on them ever since. I'm occupied with other projects so I think that sharing them is a right choice as someone could get inspired by them and decide to explore them further. I would be…

2

0

4

@p_nawrot

Piotr Nawrot

8 months

@iamtrask Check this out: . We're training character-level LM which jointly learns how to segment the characters and how to do language modeling. We get a faster and better model than Transformer-XL baseline in terms of perplexity.

0

0

4

@p_nawrot

Piotr Nawrot

8 months

lol

@agihippo

yi 🦛

8 months

the winning comment i got from an ACL review for a scaling paper was "what has FLOPS got to do with NLP". optimising for paper acceptances is like RLHF with a shitty reward model. just like how one doesn't pretrain on garbage data, one should not read conf reviews.

4

0

40

1

0

4

@p_nawrot

Piotr Nawrot

5 months

@LodestoneE621 Great, problem solved. Thanks! :)

0

0

4

@p_nawrot

Piotr Nawrot

10 months

@MSFTResearch Consider using nanoT5 () for the encoder-decoder models. It provides you with an optimized training pipeline and a simple model implementation!

Tweet card media

GitHub - PiotrNawrot/nanoT5: Fast & Simple repository for pre-training and fine-tuning T5-style...

Fast & Simple repository for pre-training and fine-tuning T5-style models - PiotrNawrot/nanoT5

2

0

3

@p_nawrot

Piotr Nawrot

2 months

@khodaless why vector quantisation and not regular tokenizer or the approach i mention in the post? - i might not be getting what the problem is but you can also take a look at this -

Tweet card media

xVal: A Continuous Number Encoding for Large Language Models

Large Language Models have not yet been broadly adapted for the analysis of scientific datasets due in part to the unique difficulties of tokenizing numbers. We propose xVal, a numerical encoding...

1

0

3

@p_nawrot

Piotr Nawrot

5 months

@michael_nielsen Are you serious to promote such a level of journalism? Lol

0

0

2

@p_nawrot

Piotr Nawrot

5 months

Also in terms of reproducibility we open-sourced our code and there's already been some successful attempts to reproduce our results based on work that cites us! :)

0

0

3

@p_nawrot

Piotr Nawrot

10 months

@kohjingyu Hey! Would you like to catch up for a chat about grounding and efficiency tomorrow?

0

0

3

@p_nawrot

Piotr Nawrot

5 months

We train on characters but bytes are also possible. Increased input length is not a problem because similarly to Google's hierarchical Hourglass we compress the input to obtain BPE-like compression. Read more in the original post:

@PontiEdoardo

Edoardo Ponti

1 year

Can we increase the efficiency *and* performance of auto-regressive models? We introduce dynamic-pooling Transformers, which jointly perform language modelling and token segmentation. @p_nawrot * @AdrianLancucki @JChorowski 📜 🧑‍💻

2

23

92

0

0

3

@p_nawrot

Piotr Nawrot

5 months

@jxmnop Same goes for all the tutorials and workshops. It was the case during ACL23 and it was really bad

0

0

3

@p_nawrot

Piotr Nawrot

5 months

@teortaxesTex @andrew_n_carr There was also Google's Hourglass () doing the same thing as MegaByte

Tweet card media

Hierarchical Transformers Are More Efficient Language Models

Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full...

0

0

3

@p_nawrot

Piotr Nawrot

8 months

@egrefen May I work remotely from somewhere else than London in the UK?

2

0

3

@p_nawrot

Piotr Nawrot

8 months

@fouriergalois Yes, soon :)

0

0

2

@p_nawrot

Piotr Nawrot

5 months

@amfoes @ggerganov @mzh1024 Thanks for tagging :) Earlier this year we have released a tokeniser that can be trained via backprop end-to-end with the Transformer decoder network ()! Late-Dec or very early next year we'll be releasing a large-scale follow-up so stay tuned :)

Tweet card media

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the...

0

0

2

@p_nawrot

Piotr Nawrot

2 months

@JohnHenryvGPT however… we have been thinking about overcoming this limitation with my supervisor and we have a couple ideas - we are actively working on it so make sure to follow me as updates are coming soon :)

0

0

2

@p_nawrot

Piotr Nawrot

2 months

@khodaless 🍻

0

0

2

@p_nawrot

Piotr Nawrot

2 months

@fouriergalois hahahahahahahahaha

0

0

2

@p_nawrot

Piotr Nawrot

6 months

@OfirPress Does it also mean support for other additive biases?

0

0

2

@p_nawrot

Piotr Nawrot

1 year

@LiuZixi9 @jonasgeiping @karpathy @EdinburghNLP Not at all, CC (pre-training dataset) is a random crawl from the web. I haven't tried other datasets. Plugging in extra datasets take a lot of time, and we've agreed that SNI is a good choice for now as it's popular, quite large, and diverse.

0

0

2

@p_nawrot

Piotr Nawrot

1 year

@omarsar0 Haha, it's great to see your tweet. I've created the nanoT5 today for the purpose of research under limited budget. The link is here: . I would be very grateful for any retweets as I'm trying to advertise this work!

@p_nawrot

Piotr Nawrot

1 year

Introducing *nanoT5* Inspired by @jonasgeiping 's Cramming and @karpathy 's nanoGPT, we fill the gap of a repository for pre-training T5-style "LLMs" under a limited budget (1xA100 GPU, ~20 hours) in PyTorch 🧑‍💻 @EdinburghNLP

Tweet media one

8

81

460

0

0

2

@p_nawrot

Piotr Nawrot

18 days

@dchaplot Why there's no comparison to Command R+ on multilingual performance?

0

0

2

@p_nawrot

Piotr Nawrot

8 months

@kbilalll @NeurIPSConf @jeankaddour @oscar__key @PMinervini Thanks <3

0

0

2

@p_nawrot

Piotr Nawrot

5 months

@vqctran Hey, are you still at the venue? I would love to catch up

0

0

2

@p_nawrot

Piotr Nawrot

2 months

@jxmnop it is a pity that he missed mine :((

0

0

2

@p_nawrot

Piotr Nawrot

2 months

@karpathy I think that Dynamic Token Pooling we’ve authored is a “real example” of tokenisation-free network - . We add dynamic tokeniser to the Transformer-XL and learn jointly to segment characters and do generative language modelling. Right now we are working on…

@PontiEdoardo

Edoardo Ponti

1 year

Can we increase the efficiency *and* performance of auto-regressive models? We introduce dynamic-pooling Transformers, which jointly perform language modelling and token segmentation. @p_nawrot * @AdrianLancucki @JChorowski 📜 🧑‍💻

2

23

92

0

0

2

@p_nawrot

Piotr Nawrot

5 months

@fouriergalois A true multi-modal follow-up is my goal for 2k24, but at the same time can't express how excited I am for what's coming up soon because it is a very important milestone. Can't say more now haha, but I'm glad that there are people waiting :)

0

0

2

@p_nawrot

Piotr Nawrot

10 months

@Yampeleg @karpathy Model implementation is unchanged. We conduct our results on the ~250M model and we observe that you don't such an amount of data as in baseline to achieve top results on the SNI benchmark for this scale. So it's either the amount of data needed for this scale or the benchmark :)

0

0

2

@p_nawrot

Piotr Nawrot

1 year

@karpathy Can it write Torch model class template or for example a feed-forward layer? Did you use it for nanoGPT? If so I would love to see a video walkthrough of Copilot :)

0

0

2

@p_nawrot

Piotr Nawrot

8 months

@dezhou Fixed tokenizers such as BPE have many drawbacks which you can read about in . Few points: You cannot backprop through BPE once fixed vocab, you cannot fine-tune it to work well on new domains, you cannot merge different models, the list goes on.

1

0

2

@p_nawrot

Piotr Nawrot

1 year

@peterjansen_ai @jonasgeiping @karpathy @EdinburghNLP You're welcome :), thanks a lot for the retweet!

0

0

2

@p_nawrot

Piotr Nawrot

10 months

@sharan0909 Hey, would you like to chat tomorrow at ACL? :)

1

0

2

@p_nawrot

Piotr Nawrot

5 months

@sytelus @andrew_n_carr This idea was exploited by Google's Hourglass () a few years ago and Dynamic Pooling I linked is a follow-up work which lets you condition the segmentation on the underlying input and make optimal grouping of variable length :)

0

0

2

@p_nawrot

Piotr Nawrot

8 months

@0xAshith @bhutanisanyam1 @PyTorch I spent quite some time tuning it so it definitely have other purposes than simply educational :) Main purpose of this repo was for it to be used by researchers who need a strong baseline model which is at the same time very accessible and easily modifiable.

1

0

2

@p_nawrot

Piotr Nawrot

2 months

@s_scardapane @AdrianLancucki @PontiEdoardo Thanks for re-tweeting! Below I also link the original thread:

@p_nawrot

Piotr Nawrot

2 months

The memory in Transformers grows linearly with the sequence length at inference time. In SSMs it is constant, but often at the expense of performance. We introduce Dynamic Memory Compression (DMC) where we retrofit LLMs to compress their KV cache while preserving performance…

Tweet media one

7

70

395

0

0

2

@p_nawrot

Piotr Nawrot

9 months

More details in the paper: Code to reproduce results: Big thanks to my amazing collaborators: @jeankaddour @oscar__key @PMinervini , and Matt J. Kusner!

Tweet card media

GitHub - JeanKaddour/NoTrainNoGain: Revisiting Efficient Training Algorithms For Transformer-based...

Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023) - JeanKaddour/NoTrainNoGain

0

0

1

@p_nawrot

Piotr Nawrot

10 months

Original thread:

@PontiEdoardo

Edoardo Ponti

1 year

Can we increase the efficiency *and* performance of auto-regressive models? We introduce dynamic-pooling Transformers, which jointly perform language modelling and token segmentation. @p_nawrot * @AdrianLancucki @JChorowski 📜 🧑‍💻

2

23

92

0

0

1