🔍 What Starling-LM-7B-beta's excellent performance tells us about benchmarks I compared the performance of @NexusflowX's model across various benchmarks. In the Chatbot Arena Leaderboard (), this 7B model impressively outperforms many larger models,… Tweet added by Maxime Labonne @maximelabonne

Maxime Labonne

3 months

🔍 What Starling-LM-7B-beta's excellent performance tells us about benchmarks I compared the performance of @NexusflowX 's model across various benchmarks. In the Chatbot Arena Leaderboard (), this 7B model impressively outperforms many larger models,…

8

26

186

Lewis Tunstall

@_lewtun

3 months

@maximelabonne @NexusflowX Wonderful analysis! Another interesting point of comparison is IFEval (measure of helpfulness) where Starling scores 44% while Mistral/Mixtral Instruct score ~54% To me, this suggests the arena has a mild bias against “uncensored” models that follow user intent to the letter

1

0

9

Maxime Labonne

@maximelabonne

3 months

@_lewtun @NexusflowX Thanks, that's an excellent point!

0

Sam Paech

@sam_paech

3 months

@maximelabonne @NexusflowX I had the opposite interpretation of this result: that Arena Elo is failing to capture the holistic capabilities of the model, and/or they figured out how to exploit human preference.

1

0

3

Maxime Labonne

@maximelabonne

3 months

@sam_paech @NexusflowX Yes it's completely possible, but I'm cautious about "exploiting human preferences" (like verbosity). In the end, that's what people want so it's difficult to say it's truly exploiting anything (although I agree it's kind of hacking it).

2

0

2

Vaclav Kosar

@vackosar

3 months

@maximelabonne @NexusflowX I agree that despite Starling-LM-7B-beta's good benchmark results, it is not that good model for real chat. Nous-Hermes-2-Mistral-7B-DPO or perhaps the PRO version is better. By the way, are you planning to add these hermes models into your benchmark?

2

0

2

Maxime Labonne

@maximelabonne

3 months

@vackosar @NexusflowX It's a good idea, I just started the evaluation of the pro version.

0

3

Michael Spencer

@Michael_AI_bro

3 months

@maximelabonne @NexusflowX This week has been crazy. Where's the newsletter just about open source AI news?

0

1

Blaze (Balázs Galambosi)

@gblazex

3 months

@maximelabonne @NexusflowX This is great. We looked at length bias in Arena when working on Alpaca and it’s not that strong actually I think it was around 0.15 spearman (That’s why alpaca had to be length controlled cause in that benchmark it was way more pronounced than in Arena leaderboard)

1

0

2

Hugo Pinheiro

@user_ops

3 months

@maximelabonne @NexusflowX I tried it and was not impressed, I'll play with it a bit more, but for now my go to general local model is Hermes 2 pro, I'm really impressed by it.

0

1

andy

@and_or_y

3 months

@maximelabonne @NexusflowX Nowadays, it seems all Mistral-class 7B models can hold a valid conversation, and the chat arena simply captures human preference of output styles - I.e bullet points raise score, numbered lists raise score, etc. but what’s more interesting to me is instruction following.

1

0

Replies