AI model built by the community, for everyone in this world
Part of the Linux Foundation, Apache 2 licensed
An RNN scaled to 14B params with GPT-level of perf
#RWKV
is One Dev's Journey to Dethrone Transformers
The largest RNN ever (up to 14B). Parallelizable. Fast inference & training. Quantizable. Low vram usage.
3+ years of hard work
Created by
@BlinkDL_AI
Computation sponsored by
@StabilityAI
@AiEleuther
Introducing Eagle-7B
Based on the RWKV-v5 architecture, bringing into opensource space, the strongest
- multi-lingual model
(beating even mistral)
- attention-free transformer today
(10-100x+ lower inference)
With comparable English performance with the best 1T 7B models
All while being
- Cleanly licensed Apache 2, under
@linuxfoundation
(do anything with it!)
- The world's greenest 7B model ๐ฒ
(by per token, energy consumption)
You can find out more from our full writeup:
๐ฆ Eagle & ๐ฆ Finch
The RWKV v5 and v6 architecture paper is here
Both of which, improve over RWKV-4, scaled up to 7.5b and 3.1b billion multilingual models respectively
Open-source code, weights, and dataset
Apache 2 licensed, under Linux Foundation
The conclusive EagleX is here
Based on the RWKV-v5 architecture, bringing into opensource 7B space, the best SOTA
- Multi-lingual model
- English perplexity model
- Attention-free transformer today
(10-100x+ lower inference)
With comparable English performance to Mistral
If you want to quickly give it a try, you can go to our official hugging face demo here, of our latest model:
We would strongly encourage you to try in non-English languages!
In terms of actual eval multi-lingual numbers, we see a substantial overall jump (by 4%!) from our previous RWKV-v4-based architecture, even with the same training dataset.
A huge win for 50% of the world's population ๐บ๏ธ
(going past 17% of the English-speaking world)
This is significant - because it shows clear evidence that RWKV / linear transformers...
Has strong potential to replace existing attention-based architecture, with its substantially lower inference cost, and no feature compromise
So all we need to do next is get GPUs & scale
RWKV V5 - 3B model (preview) is out
Final fine tune, to increase its context length to 8k is on its way. Which will also hopefully give that final score bump ๐
For now it looks on track to match the top 3B models in english, and surpass everyone in multi-lingual benchmarks ๐ค
Regardless, we plan to further train this model with another 1T token, to bring it within direct comparison with LLaMA2 7B model, and hopefully surpass it
Because it seems like we are scaling like a transformer by token count? As seen by similar 300B scores with Pythia
While English-based evals show a similar leap. It brings us in line with similar token scaling laws of transformers
Where we trade blows with other models with similar token count, or more.
Before losing out to much longer-trained models like mistral
All while being
- Cleanly licensed Apache 2, under
@linuxfoundation
(do anything with it!)
- The world's greenest 7B model ๐ฒ
(by per token, energy consumption)
- Trained on 2.25T of tokens
You can find out more from our full writeup here:
Stay tuned for more details on our upcoming models this week
- Eagle: 2.25T 7B
- Finch: 2.5T 1.6B
(Some of you probably already know where to find it, if you search through our repos / discord)
This also marks the final Eagle model, in our v5 line.
Future Finch model will be based on the v6 architecture, which is shown to have approximately 10% (give and take) improvements in performance over v5
While being upcycling compatible with v5
So here comes the finch ๐ฆ
The RWKV community wiki can be found at:
Our discord can be found at:
Give the model a try, drop by our discord, and provide us feedback on how we can improve the model for the community.
Does this cover our latest model?
No - this covers our previously released Eagle and Finch line of models, trained up to 1.1T tokens
A reminder, that as a fully Open Source project, we release in the following sequence: Code, Weights, then the paper
Not the other way around
Why is this progress significant?
Because it shows clear evidence that RWKV / linear transformers...
Has the potential to replace existing attention-based architecture, with substantially lower inference cost, and no feature compromise
๐ฆ paper at:
Wrapping up:
#RWKV
was originally created by
@BlinkDL_AI
as a project at
@AIEleuther
; and is now being hosted by
@LFAIDataFdn
Compute for this training, was sponsored by
@recursal_AI
You can find the latest EagleX model on their cloud platform here:
As with the previous 7B model, we further the open-source SOTA landscape, with leading English perplexity performance.
While maintaining SOTA multi-lingual performance across 23 languages
This is in line with our OSS group's overall goal:
To ensure the best AI models are made accessible, to everyone worldwide, regardless of language, or economic status
(approximate map of languages supported worldwide)
All while surpassing llama2 7B across a mixture of 21 popular English evals.
While closing the gap with Mistral 7B.
Proving that with continued training, the model architecture scales similarly (or better) than transformers, by tokens.
Special shout-outs to
@BlinkDL_AI
: the creator of RWKV
@AiEleuther
: awesome folks who help us in the paper authoring process
@LFAIDataFdn
: for hosting the OSS project
@StabilityAI
: for partially sponsoring the bulk of the GPU used for these documented models
If you want to quickly give it a try, you can go to our official hugging face demo here, of our latest model:
We would strongly encourage you to try in non-English languages!
@QuentinAnthon15
@BingchenZhao
In addition, shout out to the: various contributors to the dataset, the model architecture, training & inference code
Paper authorship: is reflected by paper writing contribution, which is separate from model creation/code/dataset contribution
A tiny
#RWKV
with 2.9M (!) params can solve 18239.715*9.728263 or 4.2379*564.778-1209.01 etc. with CoT, while being 100%
#RNN
(L6-D192)๐คฏThe trick: generate lots of data with reversed numbers (denoted by "f" here) to train the model๐Try it now: