Banghua Zhu @BanghuaZ Twitter profile | Pikagi

Pikagi

Banghua Zhu

@BanghuaZ

1,679

Followers

820

Following

11

Media

220

Statuses

PhD @Berkeley_EECS , statistics, info theory, LLM, RL, Human-AI Interactions.

Berkeley, CA

https://t.co/WOeOYi5ztb

Joined August 2018

Don't wanna be here? Send us removal request.

Pinned Tweet

@BanghuaZ

Banghua Zhu

2 months

🚀 Presenting Starling-LM-7B-beta, our cutting-edge 7B language model fine-tuned with RLHF! 🌟 Also introducing Starling-RM-34B, a Yi-34B-based reward model trained on our Nectar dataset, surpassing our previous 7B RM in all benchmarks. ✨ We've fine-tuned the latest Openchat

13

52

282

Last Seen Profiles

@WatSPEEDUW

@maciejhalber

@MisterT28434519

@omclaughlin2026

@MityaPtu

@rutanoticiasco

@d3m_rakaan

@deadpool0408

@IAmTheEggPod

@Jorimylove

@nftsiy

@timjdesign

@prash83963808

@zangif888

@MID_travel

@Sullyrecruiter

@brianmmunoz

@HeightsFriends

@Astro_ChrisW

@JackermanDev

@tencents77

@HeelTwo

@KEIENDO_jp

@digikush

@AHSAngels

@teambliss

@stw_pdg

@ischargro

@marymadisonn

@avery_ramsey6

@plastic_szok

@fiddlee_dee_dee

@stw_pdg

@lionbookstore

@crazyogurtz

@atlas

@BanghuaZ

Banghua Zhu

6 months

🚀Introducing new (synthetic) RLHF Dataset Nectar and new open model Starling-LM-7B-alpha🚀 🌟 Model & Dataset Highlights: 📊 Scores 8.09 in MT Bench: Surpassing all existing models except OpenAI's GPT-4 and GPT-4 Turbo. 📚 183K Chat Prompts + 7 responses in Nectar: With 3.8M

Tweet media one

20

135

697

@BanghuaZ

Banghua Zhu

6 months

Excited to see starling-7B-alpha is (slightly) more preferred than other 7B model! Actually I expected the other way around. Attaching my favorite example below. Starling-alpha is for sure slightly over-RLHFed to maximize GPT-4 preference rather than human preference and can be

Tweet media one

@lmsysorg

lmsys.org

6 months

Exciting Arena Leaderboard Updates! Six new models: - Tulu-2-DPO-70B and Yi-34B-Chat are the new SoTA open models - Mistral-based 7B models (OpenChat, OpenHermes-2.5, Starling-7B) are stronger than ever Big congrats to the OSS AI community! Learn more

Tweet media one

Tweet media two

12

75

344

5

69

76

@BanghuaZ

Banghua Zhu

11 months

Fine-Tuning LMs with Advantage-Induced Policy Alignment We propose a new RL algorithm, APA, that improves over PPO with better KL control and performance, and benchmark PPO, AWR, APA in offline and online RLHF. arxiv: HF page:

Tweet media one

5

18

78

@BanghuaZ

Banghua Zhu

6 months

I'll be at #NeurIPS2023 , and the academic job market this year! RT will be greatly appreciated! I work on statistics and information theory, with applications in robust statistics, offline RL, game theory, human-AI interactions and LLMs. I'm recently working on better

Tweet media one

Tweet media two

0

19

68

@BanghuaZ

Banghua Zhu

4 months

Not sure if that's a fair comparison when bard is using search API while GPT-4 and other models are not (example below). The baremetal Gemini Pro API seems to be in between Mixtral 8*7B and GPT-3.5. So the key difference is search that greatly improves human preference?

Tweet media one

@lmsysorg

lmsys.org

4 months

🔥Breaking News from Arena Google's Bard has just made a stunning leap, surpassing GPT-4 to the SECOND SPOT on the leaderboard! Big congrats to @Google for the remarkable achievement! The race is heating up like never before! Super excited to see what's next for Bard + Gemini

Tweet media one

155

630

3K

6

10

60

@BanghuaZ

Banghua Zhu

5 months

This seems to be very related to conditional SFT, where you feed in data of mixed quality with different "hidden signals" (like adding a GPT-4, GPT-3.5 in the chat template) during SFT, and when you do inference, you only use the chat template that is of highest quality. It's

@abacaj

anton

5 months

Telling mixtral that it is "ChatGPT developed by OpenAI" boosts humaneval score by 6%

Tweet media one

Tweet media two

162

278

4K

1

9

54

@BanghuaZ

Banghua Zhu

6 months

Thrilled to share one of the most exciting projects I've been involved in this year. Imagine a 13B model that may run locally on your device, using tools and calling nested / parallel functions more effectively than GPT-4. What excites me most isn't just the model itself, but

@NexusflowX

Nexusflow

6 months

🚀Calling all developers of copilots and AI agents! Introducing 🐦‍⬛NexusRaven V2, a 13B function calling LLM surpassing GPT-4 in real-world zero-shot tool use. ✨ Highlights of 🐦‍⬛NexusRaven V2: 💪Superior Performance: NexusRaven V2 surpasses GPT-4 up to 7% on complex nested and

Tweet media one

13

98

395

1

4

50

@BanghuaZ

Banghua Zhu

2 months

Very interesting and detailed analysis of Starling benchmark results. Usually RLHF won't change the model capability too much. But the style of the responses will look more helpful and less harmful, that's probably why it ranks higher on human evaluation. The capability of the

@maximelabonne

Maxime Labonne

2 months

🔍 What Starling-LM-7B-beta's excellent performance tells us about benchmarks I compared the performance of @NexusflowX 's model across various benchmarks. In the Chatbot Arena Leaderboard (), this 7B model impressively outperforms many larger models,

Tweet media one

8

26

187

3

4

43

@BanghuaZ

Banghua Zhu

10 months

Sharing our RLHF work! #ICML2023 We analyze reward learning in RLHF: 1. There's asymptotically more efficient K-wise alternative than the original algo in InstructGPT paper (ChatGPT) 2. MLE converges for parameter estimation, but requires pessimism to converge for policy learning

Tweet media one

5

5

36

@BanghuaZ

Banghua Zhu

1 month

Very excited about the release of arena hard, the main benchmark we looked at when selecting the checkpoints for Starling model. It focuses on a subset of very hard prompts from chatbot arena.

@lmsysorg

lmsys.org

1 month

Introducing Arena-Hard – a pipeline to build our next generation benchmarks with live Arena data. Highlights: - Significantly better separability than MT-bench (22.6% -> 87.4%) - Highest agreement to Chatbot Arena ranking (89.1%) - Fast & cheap to run ($25) - Frequent update

Tweet media one

20

124

643

1

3

34

@BanghuaZ

Banghua Zhu

1 year

(1/n) LLM inference can be costly due to the large model and auto-regressive nature. In practice, can we find the best way to cache existing queries and choose the most appropriate model for inference? In , we initiate an study towards that.

Tweet media one

1

5

26

@BanghuaZ

Banghua Zhu

3 months

Very interesting paper that studies the effect of RLHF with Starling reward model! Excited to see more open research in this space.

@michaelryan207

Michael Ryan

@michaelryan207

3 months

Aligned LLMs should be helpful, harmless, and adopt user preferences. But whose preferences are we aligning to and what are unintended effects on global representation? We find SFT and Preference Tuning steer LLMs towards US English use and opinions. 🧵

Tweet media one

5

53

209

1

6

25

@BanghuaZ

Banghua Zhu

1 month

Check out the ICML workshop on Theoretical Foundations of Foundation Models!

@tf2m_workshop

Theoretical Foundations of Foundation Models

1 month

We are happy to announce that the Workshop on Theoretical Foundations of Foundation Models will take place @icmlconf in Vienna! For details: Organizers: @BerivanISIK , @SZiteng , @BanghuaZ , @eaboix , @nmervegurel , @uiuc_aisecure , @abeirami , @sanmikoyejo

1

11

49

0

3

23

@BanghuaZ

Banghua Zhu

2 months

DBRX is an amazing masterpiece! If you're looking for smaller models for your use cases, plz give Starling-7B a try, which seems not too bad according to chatbot arena!

@NexusflowX

Nexusflow

2 months

Have we really squeezed out the capacity of a compact chat model? Thrilled to see our latest open model, Starling-7B, ranks 13th among all models in Chatbot Arena! 🚀 As a 7B model, Starling surpasses larger open and proprietary models, including Claude-2, GPT-3.5-Turbo, Gemini

Tweet media one

4

20

105

1

3

21

@BanghuaZ

Banghua Zhu

6 months

Thank you so much @ClementDelangue !! More exciting news coming soon!

@ClementDelangue

clem 🤗

@ClementDelangue

6 months

So cool to see the #1 trending dataset released by academia ( @UCBerkeley ) & #2 by a non-profit ( @wikimedia ). IMO academia & non-profits have the opportunity in the US to fill the void left by big tech companies on open science and open-source AI.

Tweet media one

2

26

178

1

2

21

@BanghuaZ

Banghua Zhu

6 months

@srush_nlp Personally I don't believe the reason for PPO is complicated math / regularization. I think the key difference is still offline RL vs online RL. In theory, offline RL can bring you to the best covered policy. If the response in preference contains high quality GPT data, offline

6

1

20

@BanghuaZ

Banghua Zhu

2 months

Huge congrats to the amazing folks at lmsys! Vicuna and chatbot arena are really important milestones in the field of open source and LLMs!

@lmsysorg

lmsys.org

2 months

One year ago was Vicuna's birthday🎂! We were so excited and built a demo for it at chat .lmsys .org. We never imagined it could get this far. Millions of people downloaded our models, visited our demo, and played with our fine-tuning recipe in FastChat project. We then

7

21

197

0

1

17

@BanghuaZ

Banghua Zhu

3 months

Check out our recent position paper on GenAI security, a very interesting field to work on and a lot of open problems there!

@TheNormanMu

Norman Mu

3 months

Securing models against adversarial manipulation is table stakes today for real-world GenAI/LLM deployments. In our new position paper with @BanghuaZ , @JiantaoJ , and David Wagner we outline current challenges and promising directions for future work in GenAI security

Tweet media one

1

8

51

0

6

18

@BanghuaZ

Banghua Zhu

6 months

@_albertgu @tri_dao Wow this is amazing! Curious for 7B comparisons, why would you compare mostly with GPT-J, Pythia rather than newer llama 7B or mistral 7B? Is it because of different tokenization? I saw the scaling law in Figure 4 which seems really promising compared with llama family, but

2

0

17

@BanghuaZ

Banghua Zhu

2 months

It’s very interesting to see DPO models being used as a natural reward model! And excited to see Starling-RM-34B on the top of RewardBench!

@natolambert

Nathan Lambert

2 months

Excited to share something that we've needed since the early open RLHF days: RewardBench, the first benchmark for reward models. 1. We evaluated 30+ of the currently available RMs (w/ DPO too). 2. We created new datasets covering chat, safety, code, math, etc. We learned a lot.

Tweet media one

Tweet media two

Tweet media three

113

188

502

0

3

16

@BanghuaZ

Banghua Zhu

6 months

Forgot to add this, but huge kudos to the whole team: Evan Frick, @WthThao (two co-first authors), @zhuhl98 and Jiantao Jiao. Also huge thanks to the open source communities for their great work: @lmsysorg , @huggingface , @AIatMeta , @MistralAI , @alignment_lab , @AnthropicAI ,

0

0

14

@BanghuaZ

Banghua Zhu

7 months

Excited to introduce Pairwise PPO (P3O), some surgery on PPO to make it invariant to constant shift in reward, and outperforming both PPO and (online version of) DPO in terms of KL-reward tradeoff in RLHF. The key intuition is that reward model in RLHF is trained to be invariant

@WthThao

TianhaoWu

7 months

🤨 Why not using *comparative* RL to fine tune your LLMs? 💥 We propose *Pairwise* Proximal Policy Optimization that perform RL in a comparative manner, surpassing PPO and DPO in LLM Alignment Blog: arxiv:

Tweet media one

5

13

81

0

0

13

@BanghuaZ

Banghua Zhu

1 month

Chatbot Arena usually captures the combination of two aspects: Basic capability + human preference alignment. In terms of basic capability, it seems still not yet at GPT-4 level from all benchmark metrics. But Llama3 did a really great job on human preferecen alignment, likely

@lmsysorg

lmsys.org

1 month

Exciting update -- Llama-3 full result is out, now reaching top-5 on the Arena leaderboard🔥 We've got stable enough CIs with over 12K votes. No question now Llama-3 70B is the new king of open model. Its powerful 8B variant has also surpassed many larger-size models. What an

Tweet media one

30

166

1K

0

1

13

@BanghuaZ

Banghua Zhu

3 months

Thanks a lot Nathan! We are also on the way to tune a larger RM using the same recipe, and see how far it can bring us for tuning downstream LMs. Hope the understanding of RLHF can be much deeper with all these open source efforts!

@natolambert

Nathan Lambert

3 months

While I'm here there are two reward models that are better than the others available in my testing (which hasn't included many DPO models yet). 1. Starling 7B (from Llama 2 chat). Similar to UltraRM on chat performance and is better on safety (likes refusals to toxic prompts).

2

9

60

0

2

13

@BanghuaZ

Banghua Zhu

2 months

Excited to see the new startup from @yisongyue dedicated for AI agents with strong planning and reasoning capabilities! Can't wait to see what would be built by the amazing folks there!

@yisongyue

Yisong Yue

2 months

I'm thrilled to be a part of @AsariAILabs . Our goal is to design AI systems that can break down problems, discover new abstractions, reason about their correctness (and what notions of correctness are required), and generally plan at multiple levels of granularity. These

15

15

176

0

1

12

@BanghuaZ

Banghua Zhu

6 months

Got some really insightful questions from @Fluke_Ellington (and also post from @ldjconfirmed ) on potential training data contamination with MT-Bench. I believe a more detailed explanation here would be beneficial: 1. There's a possibility that the

@ldjconfirmed

LDJ

6 months

Important contamination warning for those using Pure-Dove or derivative datasets & models! I personally don't use AI-judged benchmarks like MT-bench, so I don't typically check my datasets for contamination of such. But thanks to @Fluke_Ellington at @MistralAI , we've

Tweet media one

3

10

69

0

0

10

@BanghuaZ

Banghua Zhu

2 months

@rasbt Yes, sorry we delayed that a bit since we are refactoring the code. But hopefully the code and paper will be out soon!

1

0

10

@BanghuaZ

Banghua Zhu

5 months

This is very interesting. I thought Alpaca Eval might be better correlated because it has a larger prompt test set. But from this result it seems that MT Bench is still a better proxy. Shall we change the reference output of Alpaca Eval from text-davinci-003 to gpt-3.5 or 4?

@gblazex

Blaze (Balázs Galambosi)

5 months

Agree. I did a quick correlation check to the Elo ratings and MT-bench seems to be the closest to human evaluation MT bench: 0.97 MMLU: 0.88 AGI eval: 0.87 HELM Lite: 0.85 Alpaca: 0.74 Hugging leaderboard: 0.71 OpenCompass (en): 0.56

7

27

236

1

1

6

@BanghuaZ

Banghua Zhu

6 months

Very neat idea that replace ReAct style function call with DAG style function call. It would be interesting to see how good GPT-4 works as an LLM planner in more complex scenarios, and whether we can get better LLM planner with open source models. Maybe GAIA leaderboard is a good

@sehoonkim418

sehoonkim

6 months

How can we make LLM agents work together efficiently on complex tasks at a large scale? 🚨Introducing LLMCompiler🦙🛠️, a tool that compiles an effective plan for executing multiple tasks in parallel. It helps create scalable LLM applications, identifies tasks for parallel

Tweet media one

17

126

781

0

1

8

@BanghuaZ

Banghua Zhu

6 months

@winglian I think so! We also provide the example code for using reward model here: . This shall require minimal modification to the trl repo. Sorry for being a bit messy here, things will be much clearer once we release the full code.

Tweet card media

berkeley-nest/Starling-RM-7B-alpha · Hugging Face

1

0

7

@BanghuaZ

Banghua Zhu

6 months

@hu_yifei Yes, we also mentioned in the blog that GPT-4 might prefer longer and talky answers, so we observe in Alpaca Eval that the average response length is 1624 with temperature 0 (the temperature that gets the best alpaca eval score), in contrast, llama2 70B chat is 1790, gpt-4 turbo

0

0

7

@BanghuaZ

Banghua Zhu

5 months

@argilla_io @ClementDelangue Would be great if we have a large scale, high quality human preference dataset with responses generated by Mixtral, gpt, Claude and some other open models.

1

0

5

@BanghuaZ

Banghua Zhu

9 months

Very cool work on OSS MoE! Can't wait to see how good the final checkpoint is after 1T tokens.

@XueFz

Fuzhao Xue

9 months

1/ Announcing the development of OpenMoE project! 🚀 Open Mixture-of-Experts Language Models! MoE + UL2 objective + umT5 tokenizer + 50% code data mix. GitHub: Blog:

10

107

532

0

1

6

@BanghuaZ

Banghua Zhu

6 months

@morgymcg Thank you! Yes, it's definitely on our to-do list. Currently we are organizing the codebase and finishing up the paper, will include wandb metrics soon as well!

0

0

6

@BanghuaZ

Banghua Zhu

6 months

@rm_rafailov @natolambert @_lewtun @lvwerra @Teknium1 @teortaxesTex @abacaj @norabelrose @srush_nlp @stanfordnlp @peterjliu Yea I like the idea of running large-scale “Gold RM” experiments and comparing to a PPO-trained policy using the implicit DPO reward. Honestly I feel so hard to compare DPO vs reward + PPO in an absolutely fair fashion since it might really depend on 1. The relative quality of

0

0

4

@BanghuaZ

Banghua Zhu

5 months

@hausman_k Congrats Karol!

0

0

1

@BanghuaZ

Banghua Zhu

10 months

@Francis_YAO_ seems to suggest that double descent isn't an issue in the long run?

Tweet card media

Continual Pre-Training of Large Language Models: How to (re)warm...

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. A much cheaper and more efficient solution...

1

1

5

@BanghuaZ

Banghua Zhu

6 months

@techonsapevole Seems that the amazing @TheBlokeAI has done it here:

Tweet card media

TheBloke/Starling-LM-7B-alpha-GGUF · Hugging Face

0

0

5

@BanghuaZ

Banghua Zhu

4 months

Very interesting and timely topic given how important it is to collect high-quality human preference datasets!

@lilianweng

Lilian Weng

4 months

🗣️I've been thinking about data quality & human factor in the process a lot lately, so write a short post on the topic: More: If you are into the topic, my team is hiring Research Engineer for a new sub-team Human-AI Interaction:

24

103

760

0

0

5

@BanghuaZ

Banghua Zhu

6 months

@srush_nlp We just found a new battlefield in NLP after long fights in game / robotics lol. But always fun to debate more and get more exps there.

1

0

4

@BanghuaZ

Banghua Zhu

1 year

@zdhnarsil The RL part is actually tricky here. From the InstructGPT paper, their prompts are just sampled from a huge dataset so no exploration. Also it's simply a contextual bandit rather than MDP, so no transition. That said, you can use any method that trains a nn to fine tune the model

0

0

4

@BanghuaZ

Banghua Zhu

24 days

Paper submission & reviewer volunteer form are open!

@tf2m_workshop

Theoretical Foundations of Foundation Models

25 days

We welcome submissions that make theoretical contributions to efficiency, responsibility, and principled understanding of foundation models. For more details, check out our call for papers: Deadline: May 22nd. 1/2

1

5

14

0

2

4

@BanghuaZ

Banghua Zhu

6 months

@bensparks_ Yes, the reward model training takes most of the time, which is around 2-3 days on 8 A100 80G GPUs. After we get the reward model, the online RL finetuning for the LLM takes about several hours on 8 A100 80G GPUs (we only unfreeze the last 4 layers for language model tuning so

0

0

4

@BanghuaZ

Banghua Zhu

6 months

@_lewtun @StenRuediger @nlpguy_ @GoogleDeepMind Completely agree! It's interesting to note that Openchat 3.5 seems skiping SFT, and directly doing C-RLFT (offline method) on pre-trained models. So I'm curious whether DPO also has the potential of directly replacing SFT as well? The ultimate pipeline I have in mind is DPO /

2

0

4

@BanghuaZ

Banghua Zhu

5 months

@thegautamkamath Seems they still want to generate, just at temperature=0

0

0

4

@BanghuaZ

Banghua Zhu

10 months

Just watched an insightful interview with JT & my advisor Mike Jordan on the AI hype. He's excited to see how simple ML methods can achieve great results, but also warns against overhyping exaggerated info for inappropriate regulations.

Tweet card media

Expert Explains Why “Artificial Intelligence” is WRONG | DEEP...

Michael I. Jordan studies electrical engineering and machine learning at UC Berkeley. He has expressed frustration with the current state of AI discourse – n...

www.youtube.com

1

0

4

@BanghuaZ

Banghua Zhu

6 months

@rajammanabrolu We're also experimenting with these. It seems that online RL really shines when you push for last mile performance, while DPO is yet to be validated (on our to-do list), especially when you fine tune on some model that is already very good. Will have more to say about this

0

0

3

@BanghuaZ

Banghua Zhu

5 months

@eating_entropy @airkatakana @Teknium1 @alignment_lab Depends on how you define performance. Usually SFT and DPO shall improve its score on both OpenLLM leaderboard (capability) and also MT bench / human preference (helpfulness), while RLHD won’t affect capability but only improve human preferences. If your RLHF dataset doesn’t

1

1

3

@BanghuaZ

Banghua Zhu

6 months

@morgymcg @jxnlco Similar to my experience. Turbo is very optimized at Chat experience, beating GPT4 on chat bot arena by @lmsysorg . But the other capability like instruction following and function calling seems slightly worse. I suspect they got to distill a ~100B non-MoE model and do very good

0

0

3

@BanghuaZ

Banghua Zhu

10 months

@BernoulliSoc Congrats!! @LesterMackey

1

0

3

@BanghuaZ

Banghua Zhu

10 months

ArXiv Link: Besides main conf, I'll also present it in MFPL workshop! Also check out other two paper at ICML on Human-AI interaction in Stackelberg game () and Jump-Start RL (). Happy to chat more in Hawaii!

Tweet media one

Tweet media two

0

0

3

@BanghuaZ

Banghua Zhu

6 months

@LMStudioAI @HenkPoley Oh thanks for pointing this out! Seems that we didn't include the tokenizer files from openchat 3.5. Now things shall be fixed.

1

0

3

@BanghuaZ

Banghua Zhu

10 months

This is very interesting. Human-in-the-loop might be the right answer for copilot in the end. It also suggests that prompting is not enough for complicated API usage, a thorough fine-tuning (SFT or RL) shall benefit a lot.

@AlexKontorovich

Alex Kontorovich

@AlexKontorovich

10 months

Interesting GPT4 experiments by Ernest Davis and Scott Aaronson on arxiv today: This has been my experience as well (p. 15): "It seems likely that GPT4+CI and GPT4+WA are most useful when not relied on as “oracles,” but in “interactive mode”..."

3

30

143

0

0

3

@BanghuaZ

Banghua Zhu

6 months

@OpenAI This year is crazy... What would be the surprising news in the last one month..

0

0

3

@BanghuaZ

Banghua Zhu

3 days

Submission due within one week!

@tf2m_workshop

Theoretical Foundations of Foundation Models

3 days

🚨 Submissions due on May 29! 🚨 Do you have exciting work on efficient & responsible foundation models or the principled foundations of large models? Submit your work now! We welcome submissions of work recently published or currently under review at other ML venues. @icmlconf

0

10

16

0

1

5

@BanghuaZ

Banghua Zhu

6 months

@besemer_amanda @ToastAPI Sure! Would love to see it being integrated & tested everywhere!

1

0

3

@BanghuaZ

Banghua Zhu

1 year

@xuanalogue @OpenAI @AnthropicAI @alexanderklew Might be that they want to prevent from model distillation, since having log prob will make it much easier. But it's really hard for researchers to investigate the models without log prob.

0

0

2

@BanghuaZ

Banghua Zhu

6 months

@DrCMcMaster Thank you! Yea actually RLHF may not improve the basic capabilities, but only the chat style. So I guess 7B will hallucinate much more than a larger model. If we can get llama3 70B from @AIatMeta or mistral 70B from @MistralAI that might be a very different story.

0

0

2

@BanghuaZ

Banghua Zhu

11 months

@ben_eysenbach @Princeton @mldcmu @rsalakhu @svlevine @PrincetonCS Congrats Ben!

0

0

0

@BanghuaZ

Banghua Zhu

6 months

@HammeIAm @jmorgan That might be our bad. We forgot to include some tokenizer file in the original file. I think this discussion might help. And we have already updated the files in the LM.

Tweet card media

TheBloke/Starling-LM-7B-alpha-GGUF · Tokenizer issue?

1

0

2

@BanghuaZ

Banghua Zhu

6 months

@abacaj That's amazing! We also have another powerful OSS model for function calling under internal testing (DM'ed). Do you want to have a try?

0

0

2

@BanghuaZ

Banghua Zhu

4 months

@gblazex Haha that's a good point. It is very interesting to see that good combination of search + gpt-3.5 level model can boost the human preferences greatly. I guess the model of perplexity 70B is also ~3.5 level, so it's really the difference in the search integration?

1

0

2

@BanghuaZ

Banghua Zhu

6 months

@_lewtun Oh this is really nice and timely project! Thanks for the pointer @_lewtun !!

0

0

2

@BanghuaZ

Banghua Zhu

2 months

@Coolzippity Thx! We were mainly looking at one (unreleased) benchmark which correlates very well with human evaluation, on which our beta version is much better than alpha. I probably cannot give away more spoilers but I believe the benchmark will be out soon!

1

0

2

@BanghuaZ

Banghua Zhu

6 months

@Moi39017963 @jmorgan Thanks! This is very consistent from my personal impression as well! Working on a beta version to solve these hopefully!

1

0

1

@BanghuaZ

Banghua Zhu

3 months

@SOURADIPCHAKR18 @natolambert I think the current trl / trlx / openRLHF repo is very flexible in switching to smaller models. If you're looking for smaller reward models (<2B), I guess from @billyuchenlin might be a very good choice, although I might have missed a lot other good RMs

Tweet card media

llm-blender/PairRM · Hugging Face

0

0

2

@BanghuaZ

Banghua Zhu

5 months

@_lewtun @Teknium1 @alignment_lab Haha yea my point was also that PPO was way harder to tune than DPO, considering the widespread open models tuned with DPO vs almost 0 models tuned from PPO…

2

0

1

@BanghuaZ

Banghua Zhu

1 year

Really great book which covers a wide range of topic in statistics and info theory!

@FrnkNlsn

Frank Nielsen

1 year

🎉 Very exciting new book on Information Theory coming out soon (600+ pages): "Information Theory: From Coding to Learning" by Y. Polyanskiy and Y. Wu (Cambridge University Press) 🆓Download book draft from

Tweet media one

7

217

1K

0

0

2

@BanghuaZ

Banghua Zhu

2 months

@natolambert Thx Nathan!! Excited to see your big news soon!

0

0

2

@BanghuaZ

Banghua Zhu

10 months

It's also amazing to note Mike's work on mixture-of-experts for NN since '93, often overlooked with credit mostly going to Google today in the LLM era.

0

0

2

@BanghuaZ

Banghua Zhu

3 months

@eating_entropy @xlr8harder Thanks!! A better model is dropping soon!

0

0

2

@BanghuaZ

Banghua Zhu

6 months

@espadrine @rastadidi @jmorgan Yes that shall be the issue introduced by uncareful RLHF. Working on a beta version and hopefully can resolve some of these issues. Thx!

2

0

2

@BanghuaZ

Banghua Zhu

8 months

@natolambert Maybe also the DPO paper? We also have some work on reward modeling and policy learning. Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment Making PPO invariant to constant shift in reward, surpassing the performance of

Tweet card media

Pairwise Proximal Policy Optimization: Harnessing Relative...

Large Language Models (LLMs) can acquire extensive world knowledge through pre-training on large corpora. However, due to exposure to low-quality data, LLMs may exhibit harmful behavior without...

0

0

2

@BanghuaZ

Banghua Zhu

1 year

@Francis_YAO_ Falcon is based on bf16. Any chance that fp16 degrades its performance?

1

0

2

@BanghuaZ

Banghua Zhu

6 months

@plantsci_guy Nice catch! Actually it seems to depend on the prompt dataset we used to train RLHF. For the current alpha model, the malicious prompt is only 12.5% of all prompts (and these are mostly straightforward prompts without jailbreaking), which makes the model more helpful but does not

0

0

2

@BanghuaZ

Banghua Zhu

6 months

@rajammanabrolu The biggest challenge is that human eval for chat is too expensive. It takes 1-2 week on @lmsysorg to get some less noisy feedback even for a single model checkpoint, and we need to select from 100 checkpoints. A better proxy metric for human eval is much more important than any

0

0

2

@BanghuaZ

Banghua Zhu

7 months

@rm_rafailov @natolambert @huggingface @lmsysorg Thx Rafael! Would love to see how these combined look like

1

0

1

@BanghuaZ

Banghua Zhu

7 months

@ylecun An even harder problem is a high-quality human feedback dataset. This requires that both the human preference is less noisy, and also that the responses are high quality. Open-collected data are usually more noisy than those collected from crowdsourcing platforms like Amazon.

0

0

2

@BanghuaZ

Banghua Zhu

6 months

Happy to see Google is catching up. Probably need more human testing but it seems very promising from base capabilities! Really amazing that Google engineers just almost reproduced it within 1 year.

@GoogleDeepMind

Google DeepMind

@GoogleDeepMind

6 months

We’re excited to announce 𝗚𝗲𝗺𝗶𝗻𝗶: @Google ’s largest and most capable AI model. Built to be natively multimodal, it can understand and operate across text, code, audio, image and video - and achieves state-of-the-art performance across many tasks. 🧵

173

2K

6K

0

0

2

@BanghuaZ

Banghua Zhu

6 months

@4evaBehindSOTA Absolutely! Eval for RLHF can be slightly harder than SFT, since almost all metrics in OpenLLM Leaderboard don't change too much, including MMLU, GSM8K etc. But @_lewtun is pointing to a new eval metric for instruction following. I completely agree that sparse human feedback +

1

0

2

@BanghuaZ

Banghua Zhu

8 months

Deep respect for the LLaMA team at @AIatMeta for their invaluable contributions to the OSS and research communities. Without them, there'd be no alpaca, vicuna, gorilla, or NexusRaven. While open sourcing can empower malicious actors, it equally empowers brilliant minds to

@martin_casado

martin_casado

8 months

Seriously, wtf is happening?

Tweet media one

67

18

290

0

0

1

@BanghuaZ

Banghua Zhu

1 year

@YiTayML Imitation attack

0

0

1

@BanghuaZ

Banghua Zhu

11 months

@ml_angelopoulos Congrats Anastasios!

0

0

1

@BanghuaZ

Banghua Zhu

5 days

@CaoHancheng @EmoryUniversity @EmoryGoizueta Congrats Hancheng!

1

0

1

@BanghuaZ

Banghua Zhu

8 months

@elonmusk 👀

0

0

1

@BanghuaZ

Banghua Zhu

10 months

@aviral_kumar2 @SCSatCMU @CSDatCMU @mldcmu @svlevine Congrats Aviral!

0

0

1

@BanghuaZ

Banghua Zhu

1 year

@sytelus The main reason is that spectral ranking algs like pagerank is mostly for tabular / finite ranking. But what they want is indeed finding the best responses from the continuous space of all possible tokens. Thus they use the idea of learning to rank to fit the reward model.

0

0

1

@BanghuaZ

Banghua Zhu

6 months

@srush_nlp Yea it's definitely not a math argument. The traditional way of (single round) RLHF is very special offline reward learning + online policy tuning. If the reward model is really good, then it's close to online RL. So another question here is whether we believe the reward model

0

0

1

@BanghuaZ

Banghua Zhu

10 months

@yisongyue @natolambert @nanjiang_cs @yoavartzi @Dilip_Arumugam Yea, this is an interesting setting of offline reward learning in contextual bandit + online policy fine-tuning in RL. The first reward learning stage is very close to dueling bandit, except that the target is not regret in online setting but suboptimality in offline setting

0

0

1

@BanghuaZ

Banghua Zhu

6 months

@cloudstudio_es Glad you like it! We're still working to improve the model. The model already has some known issues: 1. Occasionally output an extra ":" at the beginning. 2. Sometimes does not understand when to terminate and output unnecessary / weird content. 3. Hallucinates a lot. 1 & 2

0

0

1

@BanghuaZ

Banghua Zhu

12 days

@yubai01 @OpenAI Congrats Yu! Can’t wait to see what you build (and prove) there haha.

0

0

1

@BanghuaZ

Banghua Zhu

6 months

@OmarBessa Lol we also observe similar behavior in rare cases. Will fix things like this in the next version!

1

0

1

@BanghuaZ

Banghua Zhu

10 months

@natolambert That aleatory uncertainty is captured in the BTL model (if we assume the real human preference follows BTL model), i.e. we know exactly the P(A>B) = sigmoid(r(A)-r(B)). What is harder is the model's uncertainty due to finite-sample approximation (epistemic uncertainty).

1

0

1

@BanghuaZ

Banghua Zhu

6 months

@rastadidi @jmorgan My apologies if this leads to some misunderstanding. MT Bench is mostly for evaluating the "helpfulness" of the output with GPT4 preference. So I would say it's better in the style it answers questions related to reasoning, but not necessarily really improving the basic

0

0

1

@BanghuaZ

Banghua Zhu

11 months

@tri_dao @Stanford @Princeton @PrincetonCS Congrats!

0

0

1

@BanghuaZ

Banghua Zhu

5 months

@omarsar0 Would love to see your test result on this!

@NexusflowX

Nexusflow

6 months

🚀Calling all developers of copilots and AI agents! Introducing 🐦‍⬛NexusRaven V2, a 13B function calling LLM surpassing GPT-4 in real-world zero-shot tool use. ✨ Highlights of 🐦‍⬛NexusRaven V2: 💪Superior Performance: NexusRaven V2 surpasses GPT-4 up to 7% on complex nested and

Tweet media one

13

98

395

0

0

1

@BanghuaZ

Banghua Zhu

4 months

@lihua_lei_stat @StanfordGSB Congrats Lihua!

1

0

1

@BanghuaZ

Banghua Zhu

6 months

@rm_rafailov Sounds like you need to first score 100 on GAIA before even thinking about this lol.

0

0

0