Piotr Nawrot Profile
Piotr Nawrot

@p_nawrot

3,045
Followers
225
Following
20
Media
260
Statuses

PhD student in #NLProc @Edin_CDT_NLP | Previously intern @Nvidia & @MetaAI

Warsaw
Joined July 2014
Don't wanna be here? Send us removal request.
Pinned Tweet
@p_nawrot
Piotr Nawrot
2 months
The memory in Transformers grows linearly with the sequence length at inference time. In SSMs it is constant, but often at the expense of performance. We introduce Dynamic Memory Compression (DMC) where we retrofit LLMs to compress their KV cache while preserving performance…
Tweet media one
7
70
395
@p_nawrot
Piotr Nawrot
10 months
🎇Introducing *nanoT5 v2*🎇 Inspired by @karpathy 's #nanoGPT , we improve the repo for pre-training T5 model in PyTorch. In ~16 hours on a single GPU, we achieve 40.7 RougeL on the SNI benchmark, compared to 40.9 RougeL of the original model pre-trained on 150x more data!
Tweet media one
7
83
476
@p_nawrot
Piotr Nawrot
1 year
Introducing *nanoT5* Inspired by @jonasgeiping 's Cramming and @karpathy 's nanoGPT, we fill the gap of a repository for pre-training T5-style "LLMs" under a limited budget (1xA100 GPU, ~20 hours) in PyTorch 🧑‍💻 @EdinburghNLP
Tweet media one
8
81
460
@p_nawrot
Piotr Nawrot
1 year
Great news! “Efficient Transformers with Dynamic Token Pooling” has been accepted to #ACL23 ! We increase the efficiency *and* performance of Transformer LM by jointly segmenting and modelling language. @PontiEdoardo @AdrianLancucki @JChorowski 📜
@PontiEdoardo
Edoardo Ponti
1 year
Can we increase the efficiency *and* performance of auto-regressive models? We introduce dynamic-pooling Transformers, which jointly perform language modelling and token segmentation. @p_nawrot * @AdrianLancucki @JChorowski 📜 🧑‍💻
2
23
92
1
26
146
@p_nawrot
Piotr Nawrot
2 months
I'll jump on the hype train and quote Andrej too since I have been tweeting about tokenisers for quite some time now. I believe that Dynamic Token Pooling Transformers () we've authored is a good example of a tokenisation-free network. Specifically, we…
@karpathy
Andrej Karpathy
3 months
New (2h13m 😅) lecture: "Let's build the GPT Tokenizer" Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and…
Tweet media one
383
2K
14K
5
17
137
@p_nawrot
Piotr Nawrot
7 months
nanoT5 got accepted to NLP - Open Source Software Workshop at #EMNLP2023 🎇 You can access the report about the repo here: More work about efficient methods for LLMs coming soon! 👀 See you in Singapore!
@p_nawrot
Piotr Nawrot
10 months
🎇Introducing *nanoT5 v2*🎇 Inspired by @karpathy 's #nanoGPT , we improve the repo for pre-training T5 model in PyTorch. In ~16 hours on a single GPU, we achieve 40.7 RougeL on the SNI benchmark, compared to 40.9 RougeL of the original model pre-trained on 150x more data!
Tweet media one
7
83
476
1
16
96
@p_nawrot
Piotr Nawrot
8 months
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models () accepted to @NeurIPSConf ! I'm very proud of this work : ) Big congrats to @jeankaddour , @oscar__key , @PMinervini , and Matt J. Kusner!
@jeankaddour
Jean Kaddour
10 months
📢The costs for training (L)LMs skyrocketed 🚀 in recent years, motivating efficient training algorithms. However, when pre-training BERT and T5 models with a fixed compute budget, we find their gains vanish compared to a baseline with a fully-decayed learning rate! 1/5
Tweet media one
2
28
129
5
7
79
@p_nawrot
Piotr Nawrot
9 days
Two free medium-compute Mixture-Of-Experts research ideas: Prerequisite: Mixtral 8x7B is 32 layers, at each layer there are 8 experts, each token is assigned to 2 experts at a given layer. 1) Dynamic Expert Assignment in MoE Models Every token is assigned to 2*32=64 experts in…
5
12
74
@p_nawrot
Piotr Nawrot
8 months
So glad to see the community getting interested in tackling tokenization! For anyone interested in this direction check out (). We're training a character-level LM that learns how to tokenize the input end-to-end with the model.
@jxmnop
jack morris
8 months
if I were starting my research career today and interested in language models, I would become a world expert on tokenization & tokenizers tokenization is weird, fascinating, and poorly understood, yet ubiquitous and necessary for things like chatGPT to work
6
6
161
1
5
58
@p_nawrot
Piotr Nawrot
8 months
My team at #Nvidia is looking for a Research Engineer to work on efficient Conversational AI (LLM/ASR/TTS) models. Position is remote in Europe, or on-site in Warsaw, Poland. More details: I can provide a referral : )
2
6
54
@p_nawrot
Piotr Nawrot
8 months
For anyone interested in this direction go check out (). We're training character-level LM which jointly learns how to dynamically segment the characters and to do language modeling. We get a faster and better model than Transformer-XL baseline!
@iamtrask
Andrew Trask
8 months
For anyone interested in future LLM development One of the bigger unsolved deep learning problems: learning of hierarchical structure Example: we still use tokenizers to train SOTA LLMs. We should be able to feed in bits/chars/bytes and get SOTA Related: larger context window
19
76
523
3
9
54
@p_nawrot
Piotr Nawrot
18 days
@yoavgo While working on () we discovered that we're able to retain many metrics including perplexity and many downstream tasks for very high compression ratios. Then we evaluated on MMLU and the score was terrible. From that point on our goal changed to getting…
2
1
48
@p_nawrot
Piotr Nawrot
5 months
There is fully-e2e network+tokeniser training already ()! We add dynamic tokeniser to the Transformer-XL and learn jointly to segment characters and do generative language modelling. In late-Dec we'll be releasing a large-scale follow-up, stay tuned :)
@andrew_n_carr
Andrew Carr (e/🤸)
5 months
There's a weird reality that we mostly ignore in language modeling. It's the fact that we don't _actually_ train these models end-to-end. That's because we have the tokenizer! It's actually a really frustrating piece to tune with sometimes small changes mattering a lot and…
Tweet media one
19
18
234
4
4
44
@p_nawrot
Piotr Nawrot
5 months
👨‍💻Random LLM engineering question👨‍💻 Is there any difference between these approaches for computing the attention weights? The former is more widely adopted (correct me if I'm wrong), but the latter is faster if min(Q_len, K_len) * D_head < Q_len * K_len which is like always?
Tweet media one
5
4
45
@p_nawrot
Piotr Nawrot
2 months
Great read (as always) about the KV-Cache compression which very soon will become necessary given that we are able to reason over longer and longer contexts (10M of Gemini). Also big shout-out to @Francis_YAO_ for including our Dynamic Memory Compression work in the analysis.
@Francis_YAO_
Yao Fu
2 months
We are in the age of 100K+ context window, but how does the language model attend to 100K tokens exactly? In this post, we identify the six common attention patterns across layers and heads, aiming to provide a first intuition for kv cache compression.
6
66
361
0
5
41
@p_nawrot
Piotr Nawrot
9 months
I came across this blog that digs down into (Transformer) LLM's inference from the hardware side, and I truly believe that it's a must-read for everyone working on efficient Transformers 👏
@kipperrii
kipply
2 years
transformer inference performance is becoming increasingly important and there's not as much lore on it, so here is a lot of lore that i think fully models llm inference performance
6
65
491
0
5
39
@p_nawrot
Piotr Nawrot
9 months
Do Efficient Training Algorithms / Optimizers really save us compute when training Transformer LMs? 🧐 Check out our latest work where we put some of these to the test! PS. Thanks to this work I managed to further tune the nanoT5 baseline () 😇
@jeankaddour
Jean Kaddour
10 months
📢The costs for training (L)LMs skyrocketed 🚀 in recent years, motivating efficient training algorithms. However, when pre-training BERT and T5 models with a fixed compute budget, we find their gains vanish compared to a baseline with a fully-decayed learning rate! 1/5
Tweet media one
2
28
129
1
9
30
@p_nawrot
Piotr Nawrot
9 months
Check out the follow-up work "Efficient Transformers with Dynamic Token Pooling" which improves upon the Hourglass architecture with a learnable module that dynamically segments the input sequence end-to-end with the model:
@ChrSzegedy
Christian Szegedy
9 months
Nice work!
3
10
56
1
8
27
@p_nawrot
Piotr Nawrot
9 months
Amazing work with this single-GPU repo. They fine-tuned a 32K context 3B LLaMA model in under 48 hours on just one A100. It's crazy to observe this LLM progress! Great job @CStanKonrad @s_tworkowski
@s_tworkowski
Szymon Tworkowski
9 months
🎇Introducing LongLLaMA-Instruct 32K!🎇 Inspired by @p_nawrot #nanoT5 , we fine-tune LongLLaMA- on a *single GPU* for ~48h to improve upon OpenLLaMA: 55% on lm-eval (vs. 53%), better perf on long context and code! We open-source our optimized fine-tuning code in PyTorch/HF!🧵
Tweet media one
9
78
309
0
4
23
@p_nawrot
Piotr Nawrot
2 months
[2/n] The core idea of DMC at inference time is that the KV representation of current token is: - appended to the cache (as in vanilla Transformers); or - accumulated (weighted-averaged) with the last item in the cache. At training time DMC operates in a second mode where…
1
3
22
@p_nawrot
Piotr Nawrot
10 months
Excited about new Transformer Language Model variants that offer joint segmentation and language modeling? Join us tomorrow for our presentation on "Efficient Transformers with Dynamic Token Pooling" at 11 am poster session at #ACL2023 . Can't wait to discuss it with you there
1
5
21
@p_nawrot
Piotr Nawrot
5 months
Everyone's invited to stop by the poster session of NLP-OSS Workshop at #EMNLP where you can see this piece of art poster by yourself in-person This is the last post about nanoT5 from me, if you haven't seen it check out Thanks for all the kind feedback!
Tweet media one
0
1
19
@p_nawrot
Piotr Nawrot
5 months
I am coming to Singapore 🇸🇬 for #EMNLP2023 Please drop me a message if you would like to connect or discuss any of the following: - Trainable tokenisers - Efficient Transformers - Any kind of adaptive computation - Long context modelling - LLM Scaling Can't wait to see ya :)!
0
1
19
@p_nawrot
Piotr Nawrot
10 months
Happy to share a video presentation of our work “Efficient Transformers with Dynamic Token Pooling", which has been accepted to #ACL2023 . We increase the efficiency *and* performance of Transformer LMs by jointly segmenting and modelling language. 📽️
Tweet media one
0
1
17
@p_nawrot
Piotr Nawrot
2 months
[5/n] Finally, as DMC makes independent decisions for each head / layer, it opens a window into the internal mechanisms of the LLM. We find specific regions of layers that compress the most (so most of the original information is redundant), such as between the middle and the…
Tweet media one
2
1
18
@p_nawrot
Piotr Nawrot
8 months
More evidence that including code in the pre-training mixture is essential!
@s_tworkowski
Szymon Tworkowski
8 months
✨Announcing LongLLaMA-Code 7B!✨ Have you wondered how GPT3.5 obtained its capability? Are base models of code better reasoners? 🤔 We continue pre-training CodeLLaMA on text & code to improve reasoning 🧠 Bonus: 3x faster inference @ 16K context, using Focused Transformer 🎯
Tweet media one
5
52
312
0
0
16
@p_nawrot
Piotr Nawrot
1 year
@jonasgeiping @karpathy @EdinburghNLP In nanoT5, we expose (for research purposes) and optimise everything in the training pipeline of T5 except from model implementation. Among others we use: - C4 Dataset streaming - PyTorch 2.0 compile - TF32 operations - AdamW with RMS scaling
0
0
16
@p_nawrot
Piotr Nawrot
7 months
I wanted to test this idea some time ago. We know that training on Code improves LLM's CoT and reasoning abilities. Game trajectories are a source of long sequences which require a lot of reasoning and context comprehension to model well. I'm curious to see the first results!
@laion_ai
LAION
7 months
We Release... 608 B chess moves, 236 B Rubik's Cube moves, 39 B A* moves in ASCII Mazes ... to improve planing abilities of LLMs:
20
94
577
0
1
15
@p_nawrot
Piotr Nawrot
3 months
Check tokenizers in your LLMs! Latest findings from @__gautier__ et al: 1. You can swap tokenizer in your pre-trained base model with little impact on downstreams (via fine-tuning) 2. Vocabulary size has little impact on downstreams. I wonder if we would reach the same…
@__gautier__
Gautier Dagan
3 months
PSA: Check your tokenizers! We find most code LLMs fine-tuned from a pre-trained NL model to be suboptimal for code. Preprint: This research was done during my internship @AIatMeta with @b_roziere and @syhw 1/8
Tweet media one
3
37
179
0
1
13
@p_nawrot
Piotr Nawrot
2 months
[3/n] 2x and 4x compression of the KV cache preserves (or even increases!) the performance of the original LLM (such as LLama 7B / 13B / 70B) in factuality, commonsense question answering, and coding. Not only is DMC far superior to GQA, but it can be also compounded with it:…
Tweet media one
2
2
12
@p_nawrot
Piotr Nawrot
1 year
@jonasgeiping @karpathy @EdinburghNLP Despite the continuously increasing size of pretrained Transformers, the research community still needs easy-to-reproduce and up-to-date baselines to test new hypotheses fast and at a small scale. To the best of our knowledge, there's no repository that reproduces T5 in PyTorch.
0
0
12
@p_nawrot
Piotr Nawrot
26 days
I'm trying to tackle a (impossible?) task of keeping up with the long-context LLM evaluation field and below is a list of recent papers I've found that introduce some new long context evaluation schema / dataset. Please give me a hand to have this list up-to-date, at least for…
Tweet media one
4
4
13
@p_nawrot
Piotr Nawrot
8 months
Whoah, thanks for this recognition : )
@bhutanisanyam1
Sanyam Bhutani
8 months
nanoT5: T5 model pre-training for GPU Poor! 🙏 @p_nawrot has kindly open sourced an implementation of T5-1.1 making pre-training of the model approachable on a single GPU It uses @PyTorch 2.0 and the code is very readable:
Tweet media one
2
34
174
0
1
11
@p_nawrot
Piotr Nawrot
20 days
It’s common practise to quantise LLMs to {4, 8}-bit to increase throughput/latency at a small cost to model accuracy. I was thinking how quantisation behaves for long-context scenarios (>100k) where there's a lot of tokens to process but e.g. so little values to encode your…
4
1
12
@p_nawrot
Piotr Nawrot
9 months
I wholeheartedly recommend Edoardo as a supervisor!
@PontiEdoardo
Edoardo Ponti
9 months
We have re-opened 2 PhD studentships for *2023/24* at @EdinburghNLP (1 home, 1 international), please send me a message by tomorrow if you are interested in this opportunity!
4
22
50
0
0
10
@p_nawrot
Piotr Nawrot
2 months
[4/n] This translates in practice into reduced latency and boosted throughput: now we can fit in memory much larger batches (x-axis) and/or longer examples! Thanks to an efficient implementation in Triton, the throughput gains (y-axis) in practice reach the theoretical limits…
Tweet media one
1
1
11
@p_nawrot
Piotr Nawrot
1 year
@jonasgeiping @karpathy @EdinburghNLP We make our codebase, configs and [pre-training, fine-tuning] logs publicly available to enhance the accessibility of NLP research. We are keen to hear your suggestions to improve the codebase further. Thanks to @PontiEdoardo for his early feedback!
0
0
11
@p_nawrot
Piotr Nawrot
3 months
A new bible for everyone interested in MoE models! Amazing job @XueFz @Francis_YAO_ @NiJinjie
@XueFz
Fuzhao Xue
3 months
(1/5)🚀 Our OpenMoE Paper is out! 📄 Including: 🔍ALL Checkpoints 📊 In-depth MoE routing analysis 🤯Learning from mistakes & solutions Three important findings: (1) Context-Independent Specialization; (2) Early Routing Learning; (3) Drop-towards-the-End. Paper Link:…
Tweet media one
5
107
519
0
2
11
@p_nawrot
Piotr Nawrot
8 months
@Francis_YAO_ It’s very very true. Right now (at the beginning of the PhD) I feel I need some publications to get minimum recognition, but then, after some amount of conference papers / citations, you should definitely prioritize fun over bigger numbers
0
0
9
@p_nawrot
Piotr Nawrot
7 months
Wow, this is huge! Flash Attention is now parallelised over the KV-axis!
@tri_dao
Tri Dao
7 months
Announcing Flash-Decoding, to make long-context LLM inference up to 8x faster! Great collab with @d_haziza , @fvsmassa and Grigory Sizov. Main idea: load the KV cache in parallel as fast as possible, then separately rescale to combine the results. 1/7
Tweet media one
9
151
745
0
0
10
@p_nawrot
Piotr Nawrot
1 year
@jonasgeiping @karpathy @EdinburghNLP To evaluate our model, we use the popular meta-dataset called Super Natural-Instructions (SNI), which aggregates datasets for many tasks. We achieve ~40 RougeL on the SNI test set, compared to ~42 RougeL of the original model available on HuggingFace Hub.
Tweet media one
1
0
9
@p_nawrot
Piotr Nawrot
3 months
@jxmnop I'd really want to agree here with you and live in a world where (almost) all that matters is the data quality but it's not true. For example, idk if you remember but I told you about this effort of mine towards reproducing T5 pre-training in PyTorch. Me, and some other attempts…
1
0
9
@p_nawrot
Piotr Nawrot
10 months
We share the configs, checkpoints, training logs, as well as our negative attempts towards improving pre-training efficiency. Advanced optimizers like Lion, Sophia, ALiBi positional embeddings, and FP16 mixed precision training didn't yield expected benefits.
2
0
9
@p_nawrot
Piotr Nawrot
10 months
Key upgrade in nanoT5 v2: We've leveraged BF16 precision and utilise a simplified T5 model implementation based on Huggingface's design. New implementation is easy-to-read and compatible with the HF's checkpoints. Pre-training is now 2x faster than our previous version. 🚀
Tweet media one
1
0
8
@p_nawrot
Piotr Nawrot
1 year
@jonasgeiping @karpathy @EdinburghNLP We start from the randomly initialised T5-base-v1.1 (248M parameters) implemented in HuggingFace. Next, we pre-train it on the English subset of the C4 dataset. With several ablations, we choose the most optimal choice of LR Scheduler / Optimizer / Batch Size for our hardware.
Tweet media one
0
0
8
@p_nawrot
Piotr Nawrot
5 months
Let the games begin! Looking forward to seeing multi-modal rise in 2024.
@xiangyue96
Xiang Yue
5 months
🚀 Introducing MMMU, a Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. 🧐 Highlights of the MMMU benchmark: > 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks >…
Tweet media one
Tweet media two
Tweet media three
Tweet media four
18
187
734
0
2
8
@p_nawrot
Piotr Nawrot
5 months
Don't miss hottest NeurIPS gem on long context LLMs!
@s_tworkowski
Szymon Tworkowski
5 months
Honored to win Poland's best CS master thesis prize for my work on long context LLM w/ @PiotrRMilos 🎉 Can't make it to #NeurIPS2023 😭, but @CStanKonrad will present LongLLaMA paper tmr! Thu 10:45, Poster #326 , Session 5 Interested in extending context to 256K? Come and say hi!
Tweet media one
3
32
92
0
1
7
@p_nawrot
Piotr Nawrot
5 months
Is it because of low precision (fp8 / bf16) operations and arithmetic underflow and we get better precision if we multiply/add larger numbers and normalise them in the end? I evaluated both approaches with the most recent flash-attention and they're (+-eps) equal.
0
0
7
@p_nawrot
Piotr Nawrot
3 months
New PEFT method based on sparse fine-tuning that allows you to push the limits to what you can fine-tune on your local GPU. (you have one in a lifetime opportunity to be the first person to post this hot news to your corporate papers channel on Slack so that you can reorganise…
@PontiEdoardo
Edoardo Ponti
3 months
We scaled sparse fine-tuning (SFT) to LLMs (such as Llama 2) by making it both parameter- and memory-efficient! (q)SFT instruction tuning performance is often better than (q)LoRA with comparable speed and memory load. Paper: Code:…
2
71
254
0
0
6
@p_nawrot
Piotr Nawrot
10 months
@karpathy I've just come across this Tweet, so sorry for the late reply, but you can check this work () where we propose the LM-variant which starts with characters and learns to segment the sequence dynamically (variable groups) and e2e as it goes through the model.
Tweet media one
0
0
6
@p_nawrot
Piotr Nawrot
7 months
@XueFz Just wanted to appreciate the quality of papers that you are sharing, keep up this bar! :)
1
0
5
@p_nawrot
Piotr Nawrot
2 months
@tancool_ @jxmnop I think that Dynamic Token Pooling we’ve authored is some “real example” of tokenisation-free Transformer that works with autoregressive language models - . In this work we predict the segmentation of a character-level sequence :) Let me know if you have…
@PontiEdoardo
Edoardo Ponti
1 year
Can we increase the efficiency *and* performance of auto-regressive models? We introduce dynamic-pooling Transformers, which jointly perform language modelling and token segmentation. @p_nawrot * @AdrianLancucki @JChorowski 📜 🧑‍💻
2
23
92
0
0
4
@p_nawrot
Piotr Nawrot
10 months
We test different pre-training durations: 4, 8, 12, 16, 20, and 24 hours. Result? A sweet spot at 16 hours! It has comparable performance to the original model trained on 150x more data! Time & Compute-efficient, and no compromise on quality.
Tweet media one
1
0
4
@p_nawrot
Piotr Nawrot
5 months
@andrew_n_carr Take a look at () where we add trainable "input-conditioned tokeniser" to the Transformer-XL and get generative model which jointly learns to segment characters and do language modelling. Feedback welcomed and soon we'll release a large-scale follow-up :)
1
0
5
@p_nawrot
Piotr Nawrot
5 months
PS. I know that it's a point-wise operation, but we're at this stage where we optimise everything during LLM training and it's the largest tensor in the graph :)
0
0
5
@p_nawrot
Piotr Nawrot
18 days
@arthurmensch Why there's no comparison to Command R+ on multilingual performance? I believe that Llama is much weaker than Cohere's model according to multiple sources.
0
0
5
@p_nawrot
Piotr Nawrot
9 days
I came up with both of these ideas more than a half year ago and I didn't have time to act on them ever since. I'm occupied with other projects so I think that sharing them is a right choice as someone could get inspired by them and decide to explore them further. I would be…
2
0
4
@p_nawrot
Piotr Nawrot
8 months
@iamtrask Check this out: . We're training character-level LM which jointly learns how to segment the characters and how to do language modeling. We get a faster and better model than Transformer-XL baseline in terms of perplexity.
0
0
4
@p_nawrot
Piotr Nawrot
8 months
lol
@agihippo
yi 🦛
8 months
the winning comment i got from an ACL review for a scaling paper was "what has FLOPS got to do with NLP". optimising for paper acceptances is like RLHF with a shitty reward model. just like how one doesn't pretrain on garbage data, one should not read conf reviews.
4
0
40
1
0
4
@p_nawrot
Piotr Nawrot
5 months
@LodestoneE621 Great, problem solved. Thanks! :)
0
0
4
@p_nawrot
Piotr Nawrot
10 months
@MSFTResearch Consider using nanoT5 () for the encoder-decoder models. It provides you with an optimized training pipeline and a simple model implementation!
2
0
3
@p_nawrot
Piotr Nawrot
5 months
@michael_nielsen Are you serious to promote such a level of journalism? Lol
0
0
2
@p_nawrot
Piotr Nawrot
5 months
Also in terms of reproducibility we open-sourced our code and there's already been some successful attempts to reproduce our results based on work that cites us! :)
0
0
3
@p_nawrot
Piotr Nawrot
10 months
@kohjingyu Hey! Would you like to catch up for a chat about grounding and efficiency tomorrow?
0
0
3
@p_nawrot
Piotr Nawrot
5 months
We train on characters but bytes are also possible. Increased input length is not a problem because similarly to Google's hierarchical Hourglass we compress the input to obtain BPE-like compression. Read more in the original post:
@PontiEdoardo
Edoardo Ponti
1 year
Can we increase the efficiency *and* performance of auto-regressive models? We introduce dynamic-pooling Transformers, which jointly perform language modelling and token segmentation. @p_nawrot * @AdrianLancucki @JChorowski 📜 🧑‍💻
2
23
92
0
0
3
@p_nawrot
Piotr Nawrot
5 months
@jxmnop Same goes for all the tutorials and workshops. It was the case during ACL23 and it was really bad
0
0
3
@p_nawrot
Piotr Nawrot
8 months
@egrefen May I work remotely from somewhere else than London in the UK?
2
0
3
@p_nawrot
Piotr Nawrot
8 months
@fouriergalois Yes, soon :)
0
0
2
@p_nawrot
Piotr Nawrot
5 months
@amfoes @ggerganov @mzh1024 Thanks for tagging :) Earlier this year we have released a tokeniser that can be trained via backprop end-to-end with the Transformer decoder network ()! Late-Dec or very early next year we'll be releasing a large-scale follow-up so stay tuned :)
0
0
2
@p_nawrot
Piotr Nawrot
2 months
@JohnHenryvGPT however… we have been thinking about overcoming this limitation with my supervisor and we have a couple ideas - we are actively working on it so make sure to follow me as updates are coming soon :)
0
0
2
@p_nawrot
Piotr Nawrot
2 months
0
0
2
@p_nawrot
Piotr Nawrot
2 months
@fouriergalois hahahahahahahahaha
0
0
2
@p_nawrot
Piotr Nawrot
6 months
@OfirPress Does it also mean support for other additive biases?
0
0
2
@p_nawrot
Piotr Nawrot
1 year
@LiuZixi9 @jonasgeiping @karpathy @EdinburghNLP Not at all, CC (pre-training dataset) is a random crawl from the web. I haven't tried other datasets. Plugging in extra datasets take a lot of time, and we've agreed that SNI is a good choice for now as it's popular, quite large, and diverse.
0
0
2
@p_nawrot
Piotr Nawrot
1 year
@omarsar0 Haha, it's great to see your tweet. I've created the nanoT5 today for the purpose of research under limited budget. The link is here: . I would be very grateful for any retweets as I'm trying to advertise this work!
@p_nawrot
Piotr Nawrot
1 year
Introducing *nanoT5* Inspired by @jonasgeiping 's Cramming and @karpathy 's nanoGPT, we fill the gap of a repository for pre-training T5-style "LLMs" under a limited budget (1xA100 GPU, ~20 hours) in PyTorch 🧑‍💻 @EdinburghNLP
Tweet media one
8
81
460
0
0
2
@p_nawrot
Piotr Nawrot
18 days
@dchaplot Why there's no comparison to Command R+ on multilingual performance?
0
0
2
@p_nawrot
Piotr Nawrot
5 months
@vqctran Hey, are you still at the venue? I would love to catch up
0
0
2
@p_nawrot
Piotr Nawrot
2 months
@jxmnop it is a pity that he missed mine :((
0
0
2
@p_nawrot
Piotr Nawrot
2 months
@karpathy I think that Dynamic Token Pooling we’ve authored is a “real example” of tokenisation-free network - . We add dynamic tokeniser to the Transformer-XL and learn jointly to segment characters and do generative language modelling. Right now we are working on…
@PontiEdoardo
Edoardo Ponti
1 year
Can we increase the efficiency *and* performance of auto-regressive models? We introduce dynamic-pooling Transformers, which jointly perform language modelling and token segmentation. @p_nawrot * @AdrianLancucki @JChorowski 📜 🧑‍💻
2
23
92
0
0
2
@p_nawrot
Piotr Nawrot
5 months
@fouriergalois A true multi-modal follow-up is my goal for 2k24, but at the same time can't express how excited I am for what's coming up soon because it is a very important milestone. Can't say more now haha, but I'm glad that there are people waiting :)
0
0
2
@p_nawrot
Piotr Nawrot
10 months
@Yampeleg @karpathy Model implementation is unchanged. We conduct our results on the ~250M model and we observe that you don't such an amount of data as in baseline to achieve top results on the SNI benchmark for this scale. So it's either the amount of data needed for this scale or the benchmark :)
0
0
2
@p_nawrot
Piotr Nawrot
1 year
@karpathy Can it write Torch model class template or for example a feed-forward layer? Did you use it for nanoGPT? If so I would love to see a video walkthrough of Copilot :)
0
0
2
@p_nawrot
Piotr Nawrot
8 months
@dezhou Fixed tokenizers such as BPE have many drawbacks which you can read about in . Few points: You cannot backprop through BPE once fixed vocab, you cannot fine-tune it to work well on new domains, you cannot merge different models, the list goes on.
1
0
2
@p_nawrot
Piotr Nawrot
1 year
@peterjansen_ai @jonasgeiping @karpathy @EdinburghNLP You're welcome :), thanks a lot for the retweet!
0
0
2
@p_nawrot
Piotr Nawrot
10 months
@sharan0909 Hey, would you like to chat tomorrow at ACL? :)
1
0
2
@p_nawrot
Piotr Nawrot
5 months
@sytelus @andrew_n_carr This idea was exploited by Google's Hourglass () a few years ago and Dynamic Pooling I linked is a follow-up work which lets you condition the segmentation on the underlying input and make optimal grouping of variable length :)
0
0
2
@p_nawrot
Piotr Nawrot
8 months
@0xAshith @bhutanisanyam1 @PyTorch I spent quite some time tuning it so it definitely have other purposes than simply educational :) Main purpose of this repo was for it to be used by researchers who need a strong baseline model which is at the same time very accessible and easily modifiable.
1
0
2
@p_nawrot
Piotr Nawrot
2 months
@s_scardapane @AdrianLancucki @PontiEdoardo Thanks for re-tweeting! Below I also link the original thread:
@p_nawrot
Piotr Nawrot
2 months
The memory in Transformers grows linearly with the sequence length at inference time. In SSMs it is constant, but often at the expense of performance. We introduce Dynamic Memory Compression (DMC) where we retrofit LLMs to compress their KV cache while preserving performance…
Tweet media one
7
70
395
0
0
2
@p_nawrot
Piotr Nawrot
10 months
Original thread:
@PontiEdoardo
Edoardo Ponti
1 year
Can we increase the efficiency *and* performance of auto-regressive models? We introduce dynamic-pooling Transformers, which jointly perform language modelling and token segmentation. @p_nawrot * @AdrianLancucki @JChorowski 📜 🧑‍💻
2
23
92
0
0
1