Next, we study how power law scaling decomposes for LLMs. We use the "Pythia" sequence of models from
@AiEleuther
. For instance, instead of studying just how mean test loss (on The Pile) falls off with scale, we show how the distribution over per-token losses scales: