Training with QuantNoise allows to strongly compress neural networks: 80.0% accuracy on ImageNet in 3.3MB, 82.5MB accuracy on MNLI in 14MB.
Blog:
Paper:
New paper on memory efficient open domain question answering. We show that combining dimension reduction, vector quantization and passage filtering greatly reduces the memory footprint of retrieval based systems, without hurting accuracy too much.
Paper:
🔎 Can we train dense unsupervised retrievers that are as good as BM25? With the latest contrastive learning techniques, it seems that we are getting there! Our model, the Contriever, outperforms BM25 on NQ, and is competitive on BEIR.
Paper:
Announcing Kyutai: a non-profit AI lab dedicated to open science. Thanks to Xavier Niel (
@GroupeIliad
), Rodolphe Saadé (
@cmacgm
) and Eric Schmidt (
@SchmidtFutures
), we are starting with almost 300M€ of philanthropic support. Meet the team ⬇️
New
@ACL2019_Italy
paper: Adaptive Attention Span in Transformers, with S. Sukhbaatar (
@tesatory
), P. Bojanowski,
@armandjoulin
. We scale to large context (up to 8k) and reduce memory footprint by learning attention length for each head and layer. SOTA on text8/enwik8. 1/2
Facebook AI researchers are sharing an all-attention layer to simplify the Transformer model and an adaptive attention span method to make it more efficient. Even with a much simpler architecture, these methods match or improve state-of-the-art results.
New paper with Gautier Izacard (
@gizacard
), using distillation to train information retrieval systems! We show that attention scores of a model trained on the downstream task can be used as synthetic labels. This allows to train retrievers without document or passage annotations.
New work w/
@gizacard
(Gautier Izacard): how much do generative models for open domain QA benefit from retrieval? A lot! Retrieving 100 passages, we get 51.4 EM on NaturalQuestions, 67.6 EM on TriviaQA. 1/3
Paper:
Very excited to introduce Atlas, a new retrieval augmented language model which is competitive with larger models on few-shot tasks such as question answering or fact checking.
Work lead by
@gizacard
and
@PSH_Lewis
.
Paper:
🚨We’ve been working on better retrieval-augmented models & thrilled to present Atlas, led by
@gizacard
@EXGRV
& myself🚨
Atlas is a end2end pretrained "RAG"-like model, beats models 50x its size on fewshot QA, sets numerous SotA on knowledge-intensive NLP
New paper on unsupervised mapping of word vectors, using Procrustes in Wasserstein distance, available on arxiv: . With
@armandjoulin
and Q. Berthet. More resources to come soon!
✈️ I will be attending
#NeurIPS2023
: let me know if you want to chat about the future of LLMs, and how to democratize them.
🌐 We are also hiring members of technical staff and interns
@kyutai_labs
. Happy to talk about the lab and our mission.
Super excited by the release of LLaMA, a serie of large language models, from 7B to 65B parameters. 🎉
By training longer, LLaMA obtains GPT3 level performance with a 13B model, which can run on a single GPU. Excited to see what the research community will do with these models.
Today we release LLaMA, 4 foundation models ranging from 7B to 65B parameters.
LLaMA-13B outperforms OPT and GPT-3 175B on most benchmarks. LLaMA-65B is competitive with Chinchilla 70B and PaLM 540B.
The weights for all models are open and available at
1/n
We obtain new state-of-the-art results on TriviaQA (+4.5%) and NaturalQuestions (+2.3%). We also used this technique for our winning entry to the 6 GB track of the Efficient QA competition (more on this soon).
Paper:
Hyper-parameter autotuning for fastText: get a 1MB text classifier while having a coffee.
With this new feature, it is possible to constrain the size of the final model, and automatically find the hyper-parameters giving the best results on a validation set.
Facebook AI researchers are releasing a new feature for the fastText library which provides hyper-parameter autotuning for more efficient text classifiers.
New release of our Contriever project! It includes multi-lingual models which can perform cross-lingual retrieval (eg, retrieve English documents to answer a question in Swahili), the code to (pre-)train your own retrievers, and an updated version of the paper with new results.
Code for Contriever is now available!
Code:
Paper:
Additionally we trained mContriever, a state-of-the-art multilingual neural retriever, by applying a similar contrastive learning method.
Introducing PEER, a new language model which makes text generation and editing more collaborative and controllable. It adds human in the loop, by following instructions and providing explanations.
Work lead
@timo_schick
.
Paper:
🎉 New paper 🎉 We introduce PEER, a language model trained to incrementally write texts & collaborate w/ humans in a more natural way. It can write drafts, add suggestions, follow instructions, perform edits, correct itself & provide explanations.
Link:
@OriolVinyalsML
Interesting, didn't know about this appendix since it was removed in v4 and v5. Also, all attention is different since it merges the two sublayers, hence using the same attention over parameters and hidden states.
@abacaj
Yes, we have a couple of papers on that exact topic with
@gizacard
and
@PSH_Lewis
. Combining these advances lead to the Atlas language model (paper: , code: ).
@arankomatsuzaki
Nice! By the way, we also have a new way to train retriever systems, by distilling the attention scores of the reader to the retriever:
Our main finding: generative models are great at combining information from multiple passages, as their performance keeps improving as the number of support documents increases. 2/3
Our model, at 11B parameters, and significantly less training compute, outperforms LLMs on 64-shot question answering (+3 pts wrt SOTA) or 15-shot fact checking (+5 pts wrt SOTA).
By processing passages independently in the encoder, but jointly in the decoder, our models scale to large numbers of passages, and can combine information from these multiple passages. 3/3
Previous works showed that retrieval is helpful for knowledge intensive tasks, but mostly in settings with large training sets. Here, we show how to get the same benefits for few-shot learning.
@mark_riedl
Maybe use membership inference? Give a different example (or set of examples) to each student, check if these examples were used to train the models?
@yoavgo
@_joaogui1
@BlancheMinerva
@MetaAI
@GuillaumeLample
No, it just means that after a number of tokens, it's more efficient to increase the model size than the dataset size to improve performance. Personally, I would not use the word "overtrained" to describe these models...
@yoavgo
@kroscoo
@ryandcotterell
I believe that with large neural nets, it would not really improve over a canonical segmentation. In the paper though, we did not have a subword vocabulary or canonical segmentation, and used all char ngrams (with freq. threshold).
@milesosborne
@aCraigPfeifer
That's a great and fair question! I think that a big advantage of (unsupervised) dense retrievers is that they easily benefit from a few annotated queries.
@yoavgo
@kroscoo
@ryandcotterell
This could potentially be used to find (slightly) better segmentations? But overall, yeah, having high capacity models means these differences dont really matter :/
@mo_norouzi
@zodiacJRH
@haffari
Awesome work! We had a similar model in , although we did not apply it to segmentation for NMT. Excited to see advances in this research direction.
@yoavgo
@kroscoo
@ryandcotterell
It is also possible to train by marginalizing over segmentations, if you parametrize the conditional distribution using characters.
🚨We’ve been working on better retrieval-augmented models & thrilled to present Atlas, led by
@gizacard
@EXGRV
& myself🚨
Atlas is a end2end pretrained "RAG"-like model, beats models 50x its size on fewshot QA, sets numerous SotA on knowledge-intensive NLP