Banghua Zhu Profile Banner
Banghua Zhu Profile
Banghua Zhu

@BanghuaZ

1,679
Followers
820
Following
11
Media
220
Statuses

PhD @Berkeley_EECS , statistics, info theory, LLM, RL, Human-AI Interactions.

Berkeley, CA
Joined August 2018
Don't wanna be here? Send us removal request.
Pinned Tweet
@BanghuaZ
Banghua Zhu
2 months
🚀 Presenting Starling-LM-7B-beta, our cutting-edge 7B language model fine-tuned with RLHF! 🌟 Also introducing Starling-RM-34B, a Yi-34B-based reward model trained on our Nectar dataset, surpassing our previous 7B RM in all benchmarks. ✨ We've fine-tuned the latest Openchat
13
52
282
@BanghuaZ
Banghua Zhu
6 months
🚀Introducing new (synthetic) RLHF Dataset Nectar and new open model Starling-LM-7B-alpha🚀 🌟 Model & Dataset Highlights: 📊 Scores 8.09 in MT Bench: Surpassing all existing models except OpenAI's GPT-4 and GPT-4 Turbo. 📚 183K Chat Prompts + 7 responses in Nectar: With 3.8M
Tweet media one
20
135
697
@BanghuaZ
Banghua Zhu
6 months
Excited to see starling-7B-alpha is (slightly) more preferred than other 7B model! Actually I expected the other way around. Attaching my favorite example below. Starling-alpha is for sure slightly over-RLHFed to maximize GPT-4 preference rather than human preference and can be
Tweet media one
@lmsysorg
lmsys.org
6 months
Exciting Arena Leaderboard Updates! Six new models: - Tulu-2-DPO-70B and Yi-34B-Chat are the new SoTA open models - Mistral-based 7B models (OpenChat, OpenHermes-2.5, Starling-7B) are stronger than ever Big congrats to the OSS AI community! Learn more
Tweet media one
Tweet media two
12
75
344
5
69
76
@BanghuaZ
Banghua Zhu
11 months
Fine-Tuning LMs with Advantage-Induced Policy Alignment We propose a new RL algorithm, APA, that improves over PPO with better KL control and performance, and benchmark PPO, AWR, APA in offline and online RLHF. arxiv: HF page:
Tweet media one
5
18
78
@BanghuaZ
Banghua Zhu
6 months
I'll be at #NeurIPS2023 , and the academic job market this year! RT will be greatly appreciated! I work on statistics and information theory, with applications in robust statistics, offline RL, game theory, human-AI interactions and LLMs. I'm recently working on better
Tweet media one
Tweet media two
0
19
68
@BanghuaZ
Banghua Zhu
4 months
Not sure if that's a fair comparison when bard is using search API while GPT-4 and other models are not (example below). The baremetal Gemini Pro API seems to be in between Mixtral 8*7B and GPT-3.5. So the key difference is search that greatly improves human preference?
Tweet media one
@lmsysorg
lmsys.org
4 months
🔥Breaking News from Arena Google's Bard has just made a stunning leap, surpassing GPT-4 to the SECOND SPOT on the leaderboard! Big congrats to @Google for the remarkable achievement! The race is heating up like never before! Super excited to see what's next for Bard + Gemini
Tweet media one
155
630
3K
6
10
60
@BanghuaZ
Banghua Zhu
5 months
This seems to be very related to conditional SFT, where you feed in data of mixed quality with different "hidden signals" (like adding a GPT-4, GPT-3.5 in the chat template) during SFT, and when you do inference, you only use the chat template that is of highest quality. It's
@abacaj
anton
5 months
Telling mixtral that it is "ChatGPT developed by OpenAI" boosts humaneval score by 6%
Tweet media one
Tweet media two
162
278
4K
1
9
54
@BanghuaZ
Banghua Zhu
6 months
Thrilled to share one of the most exciting projects I've been involved in this year. Imagine a 13B model that may run locally on your device, using tools and calling nested / parallel functions more effectively than GPT-4. What excites me most isn't just the model itself, but
@NexusflowX
Nexusflow
6 months
🚀Calling all developers of copilots and AI agents! Introducing 🐦‍⬛NexusRaven V2, a 13B function calling LLM surpassing GPT-4 in real-world zero-shot tool use. ✨ Highlights of 🐦‍⬛NexusRaven V2: 💪Superior Performance: NexusRaven V2 surpasses GPT-4 up to 7% on complex nested and
Tweet media one
13
98
395
1
4
50
@BanghuaZ
Banghua Zhu
2 months
Very interesting and detailed analysis of Starling benchmark results. Usually RLHF won't change the model capability too much. But the style of the responses will look more helpful and less harmful, that's probably why it ranks higher on human evaluation. The capability of the
@maximelabonne
Maxime Labonne
2 months
🔍 What Starling-LM-7B-beta's excellent performance tells us about benchmarks I compared the performance of @NexusflowX 's model across various benchmarks. In the Chatbot Arena Leaderboard (), this 7B model impressively outperforms many larger models,
Tweet media one
8
26
187
3
4
43
@BanghuaZ
Banghua Zhu
10 months
Sharing our RLHF work! #ICML2023 We analyze reward learning in RLHF: 1. There's asymptotically more efficient K-wise alternative than the original algo in InstructGPT paper (ChatGPT) 2. MLE converges for parameter estimation, but requires pessimism to converge for policy learning
Tweet media one
5
5
36
@BanghuaZ
Banghua Zhu
1 month
Very excited about the release of arena hard, the main benchmark we looked at when selecting the checkpoints for Starling model. It focuses on a subset of very hard prompts from chatbot arena.
@lmsysorg
lmsys.org
1 month
Introducing Arena-Hard – a pipeline to build our next generation benchmarks with live Arena data. Highlights: - Significantly better separability than MT-bench (22.6% -> 87.4%) - Highest agreement to Chatbot Arena ranking (89.1%) - Fast & cheap to run ($25) - Frequent update
Tweet media one
20
124
643
1
3
34
@BanghuaZ
Banghua Zhu
1 year
(1/n) LLM inference can be costly due to the large model and auto-regressive nature. In practice, can we find the best way to cache existing queries and choose the most appropriate model for inference? In , we initiate an study towards that.
Tweet media one
1
5
26
@BanghuaZ
Banghua Zhu
3 months
Very interesting paper that studies the effect of RLHF with Starling reward model! Excited to see more open research in this space.
@michaelryan207
Michael Ryan
3 months
Aligned LLMs should be helpful, harmless, and adopt user preferences. But whose preferences are we aligning to and what are unintended effects on global representation? We find SFT and Preference Tuning steer LLMs towards US English use and opinions. 🧵
Tweet media one
5
53
209
1
6
25
@BanghuaZ
Banghua Zhu
1 month
Check out the ICML workshop on Theoretical Foundations of Foundation Models!
@tf2m_workshop
Theoretical Foundations of Foundation Models
1 month
We are happy to announce that the Workshop on Theoretical Foundations of Foundation Models will take place @icmlconf in Vienna! For details: Organizers: @BerivanISIK , @SZiteng , @BanghuaZ , @eaboix , @nmervegurel , @uiuc_aisecure , @abeirami , @sanmikoyejo
1
11
49
0
3
23
@BanghuaZ
Banghua Zhu
2 months
DBRX is an amazing masterpiece! If you're looking for smaller models for your use cases, plz give Starling-7B a try, which seems not too bad according to chatbot arena!
@NexusflowX
Nexusflow
2 months
Have we really squeezed out the capacity of a compact chat model? Thrilled to see our latest open model, Starling-7B, ranks 13th among all models in Chatbot Arena! 🚀 As a 7B model, Starling surpasses larger open and proprietary models, including Claude-2, GPT-3.5-Turbo, Gemini
Tweet media one
4
20
105
1
3
21
@BanghuaZ
Banghua Zhu
6 months
Thank you so much @ClementDelangue !! More exciting news coming soon!
@ClementDelangue
clem 🤗
6 months
So cool to see the #1 trending dataset released by academia ( @UCBerkeley ) & #2 by a non-profit ( @wikimedia ). IMO academia & non-profits have the opportunity in the US to fill the void left by big tech companies on open science and open-source AI.
Tweet media one
2
26
178
1
2
21
@BanghuaZ
Banghua Zhu
6 months
@srush_nlp Personally I don't believe the reason for PPO is complicated math / regularization. I think the key difference is still offline RL vs online RL. In theory, offline RL can bring you to the best covered policy. If the response in preference contains high quality GPT data, offline
6
1
20
@BanghuaZ
Banghua Zhu
2 months
Huge congrats to the amazing folks at lmsys! Vicuna and chatbot arena are really important milestones in the field of open source and LLMs!
@lmsysorg
lmsys.org
2 months
One year ago was Vicuna's birthday🎂! We were so excited and built a demo for it at chat .lmsys .org. We never imagined it could get this far. Millions of people downloaded our models, visited our demo, and played with our fine-tuning recipe in FastChat project. We then
7
21
197
0
1
17
@BanghuaZ
Banghua Zhu
3 months
Check out our recent position paper on GenAI security, a very interesting field to work on and a lot of open problems there!
@TheNormanMu
Norman Mu
3 months
Securing models against adversarial manipulation is table stakes today for real-world GenAI/LLM deployments. In our new position paper with @BanghuaZ , @JiantaoJ , and David Wagner we outline current challenges and promising directions for future work in GenAI security
Tweet media one
1
8
51
0
6
18
@BanghuaZ
Banghua Zhu
6 months
@_albertgu @tri_dao Wow this is amazing! Curious for 7B comparisons, why would you compare mostly with GPT-J, Pythia rather than newer llama 7B or mistral 7B? Is it because of different tokenization? I saw the scaling law in Figure 4 which seems really promising compared with llama family, but
2
0
17
@BanghuaZ
Banghua Zhu
2 months
It’s very interesting to see DPO models being used as a natural reward model! And excited to see Starling-RM-34B on the top of RewardBench!
@natolambert
Nathan Lambert
2 months
Excited to share something that we've needed since the early open RLHF days: RewardBench, the first benchmark for reward models. 1. We evaluated 30+ of the currently available RMs (w/ DPO too). 2. We created new datasets covering chat, safety, code, math, etc. We learned a lot.
Tweet media one
Tweet media two
Tweet media three
113
188
502
0
3
16
@BanghuaZ
Banghua Zhu
6 months
Forgot to add this, but huge kudos to the whole team: Evan Frick, @WthThao (two co-first authors), @zhuhl98 and Jiantao Jiao. Also huge thanks to the open source communities for their great work: @lmsysorg , @huggingface , @AIatMeta , @MistralAI , @alignment_lab , @AnthropicAI ,
0
0
14
@BanghuaZ
Banghua Zhu
7 months
Excited to introduce Pairwise PPO (P3O), some surgery on PPO to make it invariant to constant shift in reward, and outperforming both PPO and (online version of) DPO in terms of KL-reward tradeoff in RLHF. The key intuition is that reward model in RLHF is trained to be invariant
@WthThao
TianhaoWu
7 months
🤨 Why not using *comparative* RL to fine tune your LLMs? 💥 We propose *Pairwise* Proximal Policy Optimization that perform RL in a comparative manner, surpassing PPO and DPO in LLM Alignment Blog: arxiv:
Tweet media one
5
13
81
0
0
13
@BanghuaZ
Banghua Zhu
1 month
Chatbot Arena usually captures the combination of two aspects: Basic capability + human preference alignment. In terms of basic capability, it seems still not yet at GPT-4 level from all benchmark metrics. But Llama3 did a really great job on human preferecen alignment, likely
@lmsysorg
lmsys.org
1 month
Exciting update -- Llama-3 full result is out, now reaching top-5 on the Arena leaderboard🔥 We've got stable enough CIs with over 12K votes. No question now Llama-3 70B is the new king of open model. Its powerful 8B variant has also surpassed many larger-size models. What an
Tweet media one
30
166
1K
0
1
13
@BanghuaZ
Banghua Zhu
3 months
Thanks a lot Nathan! We are also on the way to tune a larger RM using the same recipe, and see how far it can bring us for tuning downstream LMs. Hope the understanding of RLHF can be much deeper with all these open source efforts!
@natolambert
Nathan Lambert
3 months
While I'm here there are two reward models that are better than the others available in my testing (which hasn't included many DPO models yet). 1. Starling 7B (from Llama 2 chat). Similar to UltraRM on chat performance and is better on safety (likes refusals to toxic prompts).
2
9
60
0
2
13
@BanghuaZ
Banghua Zhu
2 months
Excited to see the new startup from @yisongyue dedicated for AI agents with strong planning and reasoning capabilities! Can't wait to see what would be built by the amazing folks there!
@yisongyue
Yisong Yue
2 months
I'm thrilled to be a part of @AsariAILabs . Our goal is to design AI systems that can break down problems, discover new abstractions, reason about their correctness (and what notions of correctness are required), and generally plan at multiple levels of granularity. These
15
15
176
0
1
12
@BanghuaZ
Banghua Zhu
6 months
Got some really insightful questions from @Fluke_Ellington (and also post from @ldjconfirmed ) on potential training data contamination with MT-Bench. I believe a more detailed explanation here would be beneficial: 1. There's a possibility that the
@ldjconfirmed
LDJ
6 months
Important contamination warning for those using Pure-Dove or derivative datasets & models! I personally don't use AI-judged benchmarks like MT-bench, so I don't typically check my datasets for contamination of such. But thanks to @Fluke_Ellington at @MistralAI , we've
Tweet media one
3
10
69
0
0
10
@BanghuaZ
Banghua Zhu
2 months
@rasbt Yes, sorry we delayed that a bit since we are refactoring the code. But hopefully the code and paper will be out soon!
1
0
10
@BanghuaZ
Banghua Zhu
5 months
This is very interesting. I thought Alpaca Eval might be better correlated because it has a larger prompt test set. But from this result it seems that MT Bench is still a better proxy. Shall we change the reference output of Alpaca Eval from text-davinci-003 to gpt-3.5 or 4?
@gblazex
Blaze (Balázs Galambosi)
5 months
Agree. I did a quick correlation check to the Elo ratings and MT-bench seems to be the closest to human evaluation MT bench: 0.97 MMLU: 0.88 AGI eval: 0.87 HELM Lite: 0.85 Alpaca: 0.74 Hugging leaderboard: 0.71 OpenCompass (en): 0.56
7
27
236
1
1
6
@BanghuaZ
Banghua Zhu
6 months
Very neat idea that replace ReAct style function call with DAG style function call. It would be interesting to see how good GPT-4 works as an LLM planner in more complex scenarios, and whether we can get better LLM planner with open source models. Maybe GAIA leaderboard is a good
@sehoonkim418
sehoonkim
6 months
How can we make LLM agents work together efficiently on complex tasks at a large scale? 🚨Introducing LLMCompiler🦙🛠️, a tool that compiles an effective plan for executing multiple tasks in parallel. It helps create scalable LLM applications, identifies tasks for parallel
Tweet media one
17
126
781
0
1
8
@BanghuaZ
Banghua Zhu
6 months
@winglian I think so! We also provide the example code for using reward model here: . This shall require minimal modification to the trl repo. Sorry for being a bit messy here, things will be much clearer once we release the full code.
1
0
7
@BanghuaZ
Banghua Zhu
6 months
@hu_yifei Yes, we also mentioned in the blog that GPT-4 might prefer longer and talky answers, so we observe in Alpaca Eval that the average response length is 1624 with temperature 0 (the temperature that gets the best alpaca eval score), in contrast, llama2 70B chat is 1790, gpt-4 turbo
0
0
7
@BanghuaZ
Banghua Zhu
5 months
@argilla_io @ClementDelangue Would be great if we have a large scale, high quality human preference dataset with responses generated by Mixtral, gpt, Claude and some other open models.
1
0
5
@BanghuaZ
Banghua Zhu
9 months
Very cool work on OSS MoE! Can't wait to see how good the final checkpoint is after 1T tokens.
@XueFz
Fuzhao Xue
9 months
1/ Announcing the development of OpenMoE project! 🚀 Open Mixture-of-Experts Language Models! MoE + UL2 objective + umT5 tokenizer + 50% code data mix. GitHub: Blog:
10
107
532
0
1
6
@BanghuaZ
Banghua Zhu
6 months
@morgymcg Thank you! Yes, it's definitely on our to-do list. Currently we are organizing the codebase and finishing up the paper, will include wandb metrics soon as well!
0
0
6
@BanghuaZ
Banghua Zhu
6 months
@rm_rafailov @natolambert @_lewtun @lvwerra @Teknium1 @teortaxesTex @abacaj @norabelrose @srush_nlp @stanfordnlp @peterjliu Yea I like the idea of running large-scale “Gold RM” experiments and comparing to a PPO-trained policy using the implicit DPO reward. Honestly I feel so hard to compare DPO vs reward + PPO in an absolutely fair fashion since it might really depend on 1. The relative quality of
0
0
4
@BanghuaZ
Banghua Zhu
5 months
@hausman_k Congrats Karol!
0
0
1
@BanghuaZ
Banghua Zhu
4 months
Very interesting and timely topic given how important it is to collect high-quality human preference datasets!
@lilianweng
Lilian Weng
4 months
🗣️I've been thinking about data quality & human factor in the process a lot lately, so write a short post on the topic: More: If you are into the topic, my team is hiring Research Engineer for a new sub-team Human-AI Interaction:
24
103
760
0
0
5
@BanghuaZ
Banghua Zhu
6 months
@srush_nlp We just found a new battlefield in NLP after long fights in game / robotics lol. But always fun to debate more and get more exps there.
1
0
4
@BanghuaZ
Banghua Zhu
1 year
@zdhnarsil The RL part is actually tricky here. From the InstructGPT paper, their prompts are just sampled from a huge dataset so no exploration. Also it's simply a contextual bandit rather than MDP, so no transition. That said, you can use any method that trains a nn to fine tune the model
0
0
4
@BanghuaZ
Banghua Zhu
24 days
Paper submission & reviewer volunteer form are open!
@tf2m_workshop
Theoretical Foundations of Foundation Models
25 days
We welcome submissions that make theoretical contributions to efficiency, responsibility, and principled understanding of foundation models. For more details, check out our call for papers: Deadline: May 22nd. 1/2
1
5
14
0
2
4
@BanghuaZ
Banghua Zhu
6 months
@bensparks_ Yes, the reward model training takes most of the time, which is around 2-3 days on 8 A100 80G GPUs. After we get the reward model, the online RL finetuning for the LLM takes about several hours on 8 A100 80G GPUs (we only unfreeze the last 4 layers for language model tuning so
0
0
4
@BanghuaZ
Banghua Zhu
6 months
@_lewtun @StenRuediger @nlpguy_ @GoogleDeepMind Completely agree! It's interesting to note that Openchat 3.5 seems skiping SFT, and directly doing C-RLFT (offline method) on pre-trained models. So I'm curious whether DPO also has the potential of directly replacing SFT as well? The ultimate pipeline I have in mind is DPO /
2
0
4
@BanghuaZ
Banghua Zhu
5 months
@thegautamkamath Seems they still want to generate, just at temperature=0
0
0
4
@BanghuaZ
Banghua Zhu
10 months
Just watched an insightful interview with JT & my advisor Mike Jordan on the AI hype. He's excited to see how simple ML methods can achieve great results, but also warns against overhyping exaggerated info for inappropriate regulations.
1
0
4
@BanghuaZ
Banghua Zhu
6 months
@rajammanabrolu We're also experimenting with these. It seems that online RL really shines when you push for last mile performance, while DPO is yet to be validated (on our to-do list), especially when you fine tune on some model that is already very good. Will have more to say about this
0
0
3
@BanghuaZ
Banghua Zhu
5 months
@eating_entropy @airkatakana @Teknium1 @alignment_lab Depends on how you define performance. Usually SFT and DPO shall improve its score on both OpenLLM leaderboard (capability) and also MT bench / human preference (helpfulness), while RLHD won’t affect capability but only improve human preferences. If your RLHF dataset doesn’t
1
1
3
@BanghuaZ
Banghua Zhu
6 months
@morgymcg @jxnlco Similar to my experience. Turbo is very optimized at Chat experience, beating GPT4 on chat bot arena by @lmsysorg . But the other capability like instruction following and function calling seems slightly worse. I suspect they got to distill a ~100B non-MoE model and do very good
0
0
3
@BanghuaZ
Banghua Zhu
10 months
1
0
3
@BanghuaZ
Banghua Zhu
10 months
ArXiv Link: Besides main conf, I'll also present it in MFPL workshop! Also check out other two paper at ICML on Human-AI interaction in Stackelberg game () and Jump-Start RL (). Happy to chat more in Hawaii!
Tweet media one
Tweet media two
0
0
3
@BanghuaZ
Banghua Zhu
6 months
@LMStudioAI @HenkPoley Oh thanks for pointing this out! Seems that we didn't include the tokenizer files from openchat 3.5. Now things shall be fixed.
1
0
3
@BanghuaZ
Banghua Zhu
10 months
This is very interesting. Human-in-the-loop might be the right answer for copilot in the end. It also suggests that prompting is not enough for complicated API usage, a thorough fine-tuning (SFT or RL) shall benefit a lot.
@AlexKontorovich
Alex Kontorovich
10 months
Interesting GPT4 experiments by Ernest Davis and Scott Aaronson on arxiv today: This has been my experience as well (p. 15): "It seems likely that GPT4+CI and GPT4+WA are most useful when not relied on as “oracles,” but in “interactive mode”..."
3
30
143
0
0
3
@BanghuaZ
Banghua Zhu
6 months
@OpenAI This year is crazy... What would be the surprising news in the last one month..
0
0
3
@BanghuaZ
Banghua Zhu
3 days
Submission due within one week!
@tf2m_workshop
Theoretical Foundations of Foundation Models
3 days
🚨 Submissions due on May 29! 🚨 Do you have exciting work on efficient & responsible foundation models or the principled foundations of large models? Submit your work now! We welcome submissions of work recently published or currently under review at other ML venues. @icmlconf
0
10
16
0
1
5
@BanghuaZ
Banghua Zhu
6 months
@besemer_amanda @ToastAPI Sure! Would love to see it being integrated & tested everywhere!
1
0
3
@BanghuaZ
Banghua Zhu
1 year
@xuanalogue @OpenAI @AnthropicAI @alexanderklew Might be that they want to prevent from model distillation, since having log prob will make it much easier. But it's really hard for researchers to investigate the models without log prob.
0
0
2
@BanghuaZ
Banghua Zhu
6 months
@DrCMcMaster Thank you! Yea actually RLHF may not improve the basic capabilities, but only the chat style. So I guess 7B will hallucinate much more than a larger model. If we can get llama3 70B from @AIatMeta or mistral 70B from @MistralAI that might be a very different story.
0
0
2
@BanghuaZ
Banghua Zhu
6 months
@HammeIAm @jmorgan That might be our bad. We forgot to include some tokenizer file in the original file. I think this discussion might help. And we have already updated the files in the LM.
1
0
2
@BanghuaZ
Banghua Zhu
6 months
@abacaj That's amazing! We also have another powerful OSS model for function calling under internal testing (DM'ed). Do you want to have a try?
0
0
2
@BanghuaZ
Banghua Zhu
4 months
@gblazex Haha that's a good point. It is very interesting to see that good combination of search + gpt-3.5 level model can boost the human preferences greatly. I guess the model of perplexity 70B is also ~3.5 level, so it's really the difference in the search integration?
1
0
2
@BanghuaZ
Banghua Zhu
6 months
@_lewtun Oh this is really nice and timely project! Thanks for the pointer @_lewtun !!
0
0
2
@BanghuaZ
Banghua Zhu
2 months
@Coolzippity Thx! We were mainly looking at one (unreleased) benchmark which correlates very well with human evaluation, on which our beta version is much better than alpha. I probably cannot give away more spoilers but I believe the benchmark will be out soon!
1
0
2
@BanghuaZ
Banghua Zhu
6 months
@Moi39017963 @jmorgan Thanks! This is very consistent from my personal impression as well! Working on a beta version to solve these hopefully!
1
0
1
@BanghuaZ
Banghua Zhu
3 months
@SOURADIPCHAKR18 @natolambert I think the current trl / trlx / openRLHF repo is very flexible in switching to smaller models. If you're looking for smaller reward models (<2B), I guess from @billyuchenlin might be a very good choice, although I might have missed a lot other good RMs
0
0
2
@BanghuaZ
Banghua Zhu
5 months
@_lewtun @Teknium1 @alignment_lab Haha yea my point was also that PPO was way harder to tune than DPO, considering the widespread open models tuned with DPO vs almost 0 models tuned from PPO…
2
0
1
@BanghuaZ
Banghua Zhu
1 year
Really great book which covers a wide range of topic in statistics and info theory!
@FrnkNlsn
Frank Nielsen
1 year
🎉 Very exciting new book on Information Theory coming out soon (600+ pages): "Information Theory: From Coding to Learning" by Y. Polyanskiy and Y. Wu (Cambridge University Press) 🆓Download book draft from
Tweet media one
7
217
1K
0
0
2
@BanghuaZ
Banghua Zhu
2 months
@natolambert Thx Nathan!! Excited to see your big news soon!
0
0
2
@BanghuaZ
Banghua Zhu
10 months
It's also amazing to note Mike's work on mixture-of-experts for NN since '93, often overlooked with credit mostly going to Google today in the LLM era.
0
0
2
@BanghuaZ
Banghua Zhu
3 months
@eating_entropy @xlr8harder Thanks!! A better model is dropping soon!
0
0
2
@BanghuaZ
Banghua Zhu
6 months
@espadrine @rastadidi @jmorgan Yes that shall be the issue introduced by uncareful RLHF. Working on a beta version and hopefully can resolve some of these issues. Thx!
2
0
2
@BanghuaZ
Banghua Zhu
8 months
@natolambert Maybe also the DPO paper? We also have some work on reward modeling and policy learning. Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment Making PPO invariant to constant shift in reward, surpassing the performance of
0
0
2
@BanghuaZ
Banghua Zhu
1 year
@Francis_YAO_ Falcon is based on bf16. Any chance that fp16 degrades its performance?
1
0
2
@BanghuaZ
Banghua Zhu
6 months
@plantsci_guy Nice catch! Actually it seems to depend on the prompt dataset we used to train RLHF. For the current alpha model, the malicious prompt is only 12.5% of all prompts (and these are mostly straightforward prompts without jailbreaking), which makes the model more helpful but does not
0
0
2
@BanghuaZ
Banghua Zhu
6 months
@rajammanabrolu The biggest challenge is that human eval for chat is too expensive. It takes 1-2 week on @lmsysorg to get some less noisy feedback even for a single model checkpoint, and we need to select from 100 checkpoints. A better proxy metric for human eval is much more important than any
0
0
2
@BanghuaZ
Banghua Zhu
7 months
@rm_rafailov @natolambert @huggingface @lmsysorg Thx Rafael! Would love to see how these combined look like
1
0
1
@BanghuaZ
Banghua Zhu
7 months
@ylecun An even harder problem is a high-quality human feedback dataset. This requires that both the human preference is less noisy, and also that the responses are high quality. Open-collected data are usually more noisy than those collected from crowdsourcing platforms like Amazon.
0
0
2
@BanghuaZ
Banghua Zhu
6 months
Happy to see Google is catching up. Probably need more human testing but it seems very promising from base capabilities! Really amazing that Google engineers just almost reproduced it within 1 year.
@GoogleDeepMind
Google DeepMind
6 months
We’re excited to announce 𝗚𝗲𝗺𝗶𝗻𝗶: @Google ’s largest and most capable AI model. Built to be natively multimodal, it can understand and operate across text, code, audio, image and video - and achieves state-of-the-art performance across many tasks. 🧵
173
2K
6K
0
0
2
@BanghuaZ
Banghua Zhu
6 months
@4evaBehindSOTA Absolutely! Eval for RLHF can be slightly harder than SFT, since almost all metrics in OpenLLM Leaderboard don't change too much, including MMLU, GSM8K etc. But @_lewtun is pointing to a new eval metric for instruction following. I completely agree that sparse human feedback +
1
0
2
@BanghuaZ
Banghua Zhu
8 months
Deep respect for the LLaMA team at @AIatMeta for their invaluable contributions to the OSS and research communities. Without them, there'd be no alpaca, vicuna, gorilla, or NexusRaven. While open sourcing can empower malicious actors, it equally empowers brilliant minds to
@martin_casado
martin_casado
8 months
Seriously, wtf is happening?
Tweet media one
67
18
290
0
0
1
@BanghuaZ
Banghua Zhu
1 year
@YiTayML Imitation attack
0
0
1
@BanghuaZ
Banghua Zhu
11 months
@ml_angelopoulos Congrats Anastasios!
0
0
1
@BanghuaZ
Banghua Zhu
8 months
0
0
1
@BanghuaZ
Banghua Zhu
1 year
@sytelus The main reason is that spectral ranking algs like pagerank is mostly for tabular / finite ranking. But what they want is indeed finding the best responses from the continuous space of all possible tokens. Thus they use the idea of learning to rank to fit the reward model.
0
0
1
@BanghuaZ
Banghua Zhu
6 months
@srush_nlp Yea it's definitely not a math argument. The traditional way of (single round) RLHF is very special offline reward learning + online policy tuning. If the reward model is really good, then it's close to online RL. So another question here is whether we believe the reward model
0
0
1
@BanghuaZ
Banghua Zhu
10 months
@yisongyue @natolambert @nanjiang_cs @yoavartzi @Dilip_Arumugam Yea, this is an interesting setting of offline reward learning in contextual bandit + online policy fine-tuning in RL. The first reward learning stage is very close to dueling bandit, except that the target is not regret in online setting but suboptimality in offline setting
0
0
1
@BanghuaZ
Banghua Zhu
6 months
@cloudstudio_es Glad you like it! We're still working to improve the model. The model already has some known issues: 1. Occasionally output an extra ":" at the beginning. 2. Sometimes does not understand when to terminate and output unnecessary / weird content. 3. Hallucinates a lot. 1 & 2
0
0
1
@BanghuaZ
Banghua Zhu
12 days
@yubai01 @OpenAI Congrats Yu! Can’t wait to see what you build (and prove) there haha.
0
0
1
@BanghuaZ
Banghua Zhu
6 months
@OmarBessa Lol we also observe similar behavior in rare cases. Will fix things like this in the next version!
1
0
1
@BanghuaZ
Banghua Zhu
10 months
@natolambert That aleatory uncertainty is captured in the BTL model (if we assume the real human preference follows BTL model), i.e. we know exactly the P(A>B) = sigmoid(r(A)-r(B)). What is harder is the model's uncertainty due to finite-sample approximation (epistemic uncertainty).
1
0
1
@BanghuaZ
Banghua Zhu
6 months
@rastadidi @jmorgan My apologies if this leads to some misunderstanding. MT Bench is mostly for evaluating the "helpfulness" of the output with GPT4 preference. So I would say it's better in the style it answers questions related to reasoning, but not necessarily really improving the basic
0
0
1
@BanghuaZ
Banghua Zhu
5 months
@omarsar0 Would love to see your test result on this!
@NexusflowX
Nexusflow
6 months
🚀Calling all developers of copilots and AI agents! Introducing 🐦‍⬛NexusRaven V2, a 13B function calling LLM surpassing GPT-4 in real-world zero-shot tool use. ✨ Highlights of 🐦‍⬛NexusRaven V2: 💪Superior Performance: NexusRaven V2 surpasses GPT-4 up to 7% on complex nested and
Tweet media one
13
98
395
0
0
1
@BanghuaZ
Banghua Zhu
4 months
1
0
1
@BanghuaZ
Banghua Zhu
6 months
@rm_rafailov Sounds like you need to first score 100 on GAIA before even thinking about this lol.
0
0
0