Excited to share our work at
@GoogleDeepMind
!
We propose Naturalized Execution Tuning (NExT), a self-training method that drastically improves the LLM's ability to reason about code execution, by learning to inspect execution traces and generate chain-of-thought rationales ๐งต๐
I recently gave a guest lecture (outline below) about LLMs for code and math for the "AI Foundation Models" course at Yale, and I've just made the slides and recordings publicly available:
slides:
recordings:
Execution results are strong indicators of program correctness. But how can we improve LLMs for code generation with execution?
In our new paper, we propose LEVER, a simple method that learns to verify and rerank LLM-generated programs with their execution results. ๐งต๐ (1/n)
As one of Drago's current PhD students, I am still in shock and disbelief. Drago means so much more to me than the word "advisor" could ever entail. He is a great mentor, a good friend, and one of the kindest and most down-to-earth people I've ever known 1/
How good are current LLMs at translating natural language into executable code?
Introducing L2CEval, where we benchmark language-to-code (L2C) generation abilities of 54 models from 12 orgs, testing on 7 tasks from 3 core domains.
Here is what we found in this first release of
Late advertisement but Iโm giving a talk
@MIT_CSAIL
on Program Synthesis w/ LLM this afternoon at 4PM EST. This talk is open to everyone so feel free to join in person or over zoom! More info (w/ zoom link) here:
While the same NL specs can often be satisfied by different programs, most datasets only provides one for learning. This could easily lead to overfitting as the figure shown below.
In our new
#ICLR2023
paper, we show how we can mitigate this issue with self-sampling ๐งต1/7
HumanEval & MBPP are top datasets in evaluating LLMs for code. Despite common suspicion of contamination, quantifying it is hard as it would require massive pairwise comparison between the examples in datasets and the pretraining corpus โ and weโve done exactly this๐งต๐
In this
#ICLR2023
is my first in-person conference post-covid and itโs the most rewarding experience ever. I reunited with my friends, collaborators, mentorsโฆ some of which I havenโt seen in years and some I actually met in person for the first time. 1/n
Execution results are strong indicators of program correctness. But how can we improve LLMs for code generation with execution?
In our new paper, we propose LEVER, a simple method that learns to verify and rerank LLM-generated programs with their execution results. ๐งต๐ (1/n)
This is my last week as an intern
@GoogleDeepMind
. The past summer was nothing but inspiring. Thanks everyone in the learning for code team, for making me feel so welcomed since day one. Next, Iโll be on the job market later this year, stay tuned!
Excited to share our work at
@GoogleDeepMind
!
We propose Naturalized Execution Tuning (NExT), a self-training method that drastically improves the LLM's ability to reason about code execution, by learning to inspect execution traces and generate chain-of-thought rationales ๐งต๐
Seems like a good time to share that I am joining Google Brain (now Google DeepMind) as a research intern this summer! I will be working on code generation + LLM with
@pengchengyin
in
@RandomlyWalking
โs team.
The phenomenal teams from Google Researchโs Brain and
@DeepMind
have made many of the seminal research advances that underpin modern AI, from Deep RL to Transformers. Now weโre joining forces as a single unit, Google DeepMind, which Iโm thrilled to lead!
Soโฆ
In 2021, Codex was out during the 1st month of my internship
@MSFTResearch
;
In 2022, OPT was released right before my internship
@MetaAI
;
Now in 2023, PaLM 2 is out 3 weeks before my internship
@DeepMind
I surely know how to appear in the right place at the right time๐
๐ Make sure to check out the technical report for fun examples, details about building the model, and more:
#GoogleIO
#GoogleIO2023
๐ญ Am also tearing up a little. So dang proud of this awesome team and excited to continue this work at Google DeepMind:
Switching from one library to another (e.g., tf->torch) and tired of the manual refactoring needs to be done?
We aim to solve this problem with our new
@ICSEconf
#icse21
paper: "SOAR: A Synthesis Approach for Data Science API Refactoring".
[1/4]
Hey cool people at
#ICML2023
: We will present our poster tomorrow (Thu) from 1:30PM to whenever it takes! Come and chat with me,
@VictoriaLinML
and
@sidawxyz
if youโre interested in code generation, LLM or training verifiers!
Execution results are strong indicators of program correctness. But how can we improve LLMs for code generation with execution?
In our new paper, we propose LEVER, a simple method that learns to verify and rerank LLM-generated programs with their execution results. ๐งต๐ (1/n)
With
#ACL2023NLP
wrapping up, it's time to warm up for
#ICML2023
!
Check out the online demo for our ICML paper, "LEVER: Learning to Verify Language-to-Code Generation using Execution", now available on๐คspaces! We also release code, model weights, and more in ๐งต๐:
Btw, if you haven't, check out this great course by
@armancohan
on AI foundation models, which covers a wide range of topics about LLMs (e.g., PET, RAG, etc). All course materials (slides, notes, hws, code) are publicly available.
course website:
When I applied to PhD programs a few years ago, I got rejected from 14/15 schools I applied to, and Drago is the only one who accepted me. And it turns out I was not the only one in our lab. He always sees the best in the students and offers opportunities whenever he can 2/
Been trying out
#ChatGPT
today and honestly I am not very impressed. Many known issues for GPT-3 still remains. Here is my favorite failure case where it shows no logic in its reasoning. More in the ๐งต below (1/7):
Hey cool people at
#ICLR2023
! We are presenting this work in the poster room at station
#26
today (5.3) from 11:30AM to whenever it takes! Come and talk with us about the paper, program synthesis, LLM and more!
More info:
While the same NL specs can often be satisfied by different programs, most datasets only provides one for learning. This could easily lead to overfitting as the figure shown below.
In our new
#ICLR2023
paper, we show how we can mitigate this issue with self-sampling ๐งต1/7
I was deeply touched by the number of people sharing their stories with Drago, realizing that his influence was far beyond my imagination, which is why I took the courage to share mine. As his PhD student, I will continue his research and more importantly, spread his kindness 7/
Okay, the hack is to fold the cover of your iPad and insert it into the crack on top of the seat back screen, and wirelessly connect to your Mac using sidecar, works like 82% of the time.
In multi-doc and open-domain QA, the correct answer can often be derived from different sources of evidence, but typically only one is annotated as gold. How does this affect the training of retrieval and reasoning models?
Check out our new
#EMNLP2021
paper (a thread):๐
The first thing I do whenever a new "SoTA" model is released is testing its ability to reason about program execution.
Unfortunately, Claude 3 Opus (large) can't even reason about the simplest program. But GPT-4 does this quite well.
Claude (left) vs. GPT (right)
He offered me this opportunity and believed in me. And I have been working hard ever since, trying to repay his trust and prove that he did not place his bet wrong. And as I have only 1 yr left in my PhD, it really pains me to think that I won't see him at the finish line 3/
It pains me to see the papers he sent in my inbox just days ago. It also pains me to think the paper I am writing will be the last paper I coauthor with him. However, I think his greatest legacy is not the papers he published, but the people he influenced throughout the years 6/
During our meetings, I would talk to Drago about anything, from research to career advice, from soccer to rock bands. He would show me videos of him being a translator at 1994 world cup, and I would share the videos of me playing rock guitar solos 4/
So you can see your fellow reviewersโ names for ICLR this year? This must have taken those self-citers for โmissing referencesโ by surprise lol. Amazed by the amount of people who actually do that.
L2CEval is very much an ongoing work but I simply can't keep those results to myself any longer๐
Behind each number is a jsonl file that saves all the output tokens and logits, and we are doing more digging as we speak๐คฉ So let us know if you think of something interesting!
Excited to finally share about our new summarization toolkit - SummerTime, which is accepted to
#EMNLP2021
Demos!
GitHub(100+โญ๏ธ):
We built this library specifically for non-expert users, with several merits, for several reasons (a thread ๐งต):
โWheneverโ ended up being 3:45PM ๐ตThanks everyone who stopped by our poster yesterday! If you missed it but still would like to know more about this work, feel free to DM me!
Hey cool people at
#ICML2023
: We will present our poster tomorrow (Thu) from 1:30PM to whenever it takes! Come and chat with me,
@VictoriaLinML
and
@sidawxyz
if youโre interested in code generation, LLM or training verifiers!
The AI safety Iโm worried about:
* Self-driving car crashes
* Robot loses grip of a knife when cooking
* AI wrote a bug in rocket launching software
* False-negative diagnosis of diseases
The AI safety Iโm not worried about:
* A language model going rouge plotting against me
#AAAI2020
I will be giving a 20min oral presentation of our work
#8812
โMerging Weak and Active Supervision for Semantic Parsingโ on 12th (Wed), 15:50 at Sutton South room. Joint work with
@pengchengyin
and
@gneubig
. Paper link:
I had the pleasure working with Graham during my CMU days. Btw, I had zero NLP experience when we got started, so you know Graham means it when he says โall backgrounds are welcomedโ :)
Next year I will be looking for 1-2 PhD students who are interested in doing deep and impactful work on NLP! (areas are open, but I like multilingual NLP/compling, natural language interfaces, ML for NLP)
Please apply below and mention me in your app: 1/2
If you missed this one, I am going to talk about it again in an online seminar
@hkust
on Monday at 9AM HKT. Thanks
@shenjiasi
for the invite! More info (w/ zoom link) here:
Late advertisement but Iโm giving a talk
@MIT_CSAIL
on Program Synthesis w/ LLM this afternoon at 4PM EST. This talk is open to everyone so feel free to join in person or over zoom! More info (w/ zoom link) here:
I thought they are just gonna start charging but discontinuing the API w/ a 3-day notice shows that they have no respect for the research community whatsoeverโฆ
OAI will discontinue support for Codex models starting March 26. And just like that, all papers and ideas built atop codex (> 200 on ArXiv) will not be replicable or usable as is. Do we still think openness doesnโt matter?
My last email from Drago was him saying sorry about missing my talk on Sunday as he needed to get his daughter ready for bed. And when I got the chance to reply, I said "no worries, I will go out on a limb and say I did a great job :)". Little did I know he never got this msg 5/
I wasnโt really buying the whole โmulti-modal LLM is the futureโ thing till I used GPT-4V. This is mind blowing, canโt imagine how many use cases are out there.
As an ordinary PhD student studying NLP, I have a mixed feeling about GPT-4. It is certainly disheartening as it makes me question the worth of my own research. But the thrill is too overwhelming :grinning:
Just wrote a recommendation letter for the first time (in support of a tenure as a student). This feeling of being able to support someone who helped me tremendously in the past is truly great.
To help LLMs better understand program execution traces, we propose an inline trace representation, which encodes execution states as update variable values within inline comments. We also add the ordinal numbers "(0) ..." to denote execution order [2/n]
We conducted exps on 4โฃ NL2Code tasks and 3โฃ code LLMs. Results show that LEVER consistently improves the perf across different LLMs and datasets, while achieving the new SOTA results on all of them using code-davinci-002. (3/n)
@agihippo
Like "we instruction tuned on xx dataset" and got massive 20% improvements and beats all other OS models.
Then after tracking down appendix/citation you realize "xx" is in-domain data generated by GPT-4
Wow, we had ~20 attendees in person and 40+ more on zoom. Thanks everyone for dropping by! Also thanks a lot for hosting me,
@minimario1729
and Armando! Link to the recordings: (start from minute 27)
Late advertisement but Iโm giving a talk
@MIT_CSAIL
on Program Synthesis w/ LLM this afternoon at 4PM EST. This talk is open to everyone so feel free to join in person or over zoom! More info (w/ zoom link) here:
@huybery
I think itโs easy for people to have 1,000 citations to go from 0->5k followers on twitter than people with 5k followers to get 0->1k citations
@lvwerra
This is awesome! I am wondering if you've tested how much GPU RAM is it able to use? Since CPU and GPU share the same RAM, it would be wonderful if it's actually able to take advantage of the whole memory.
we are starting our rollout of ChatGPT plugins.
you can install plugins to help with a wide variety of tasks. we are excited to see what developers create!
LEVER is trained to verify the correctness of a program based on the NL input, the program itself and its exec results. Then the verification prob is used in combination with the generation prob to reranks the program candidates sampled from the LLMs (2/n)
May have found the shortest text to make DALL-E fail: "keyboard". Ablations: 1) "computer keyboard" also fails; 2) adding "English" does not help. More suggestions are welcomed ๐
To me, the key to
#ChatGPT
โs wild popularity is not itโs technology innovation but 1) adopting the conversational format; 2) having an open web demo. This makes it possible for everyone to try it just in a dialogue box.
1/ In 2021, we shared next-gen language + conversation capabilities powered by our Language Model for Dialogue Applications (LaMDA). Coming soon: Bard, a new experimental conversational
#GoogleAI
service powered by LaMDA.
As an iterative self-training method, NExT first bootstraps a set of high-quality chain-of-thought rationales, by naturalizing the execution traces into execution-aware CoT rationales written in NL. Then we finetune LLMs on the rationales that lead to correct code outputs [3/n]
Tracking the training process, we found that reasoning with execution traces is crucial for the success of NExT. We also found that learning to reason in natural language not only provides interpretability, but also improves generalization and sample diversity [7/n]
My collaborators would always tell me to remove the arxiv comments before submitting, and I was like "who will be bored enough to dig up latex comments" and here we go ๐
You might know that MSFT has released a 154-page paper () on
#OpenAI
#GPT4
, but do you know they also commented out many parts from the original version?
๐งต: A thread of hidden information from their latex source code
[1/n]
I would typically write long and thorough reviews but it's just so frustrating to see some random error thrown and 2+ hrs effort gone.
@openreviewnet
isn't bad, just saying,
@aclmeeting
. At least it has auto-save.
Sending these tweets on my way to Kigali ๐ท๐ผ! Hope to see everyone there! Feel free to DM me if youโd like to chat about program synthesis, LLM + Code, neuro-symbolic methods, and many more! 7/7
We experiment with two program repair (debugging) datasets MBPP-R and HumanEvalFix+, which are MBPP and HE+ re-purposed for program repair. On MBPP-R, NExT improves the program fix rate of PaLM 2-L model by 26.1% and it also yields large improvements on HeFix+ [4/n]
My friend Drago just passed away. He left behind his wife, Axinia, and daughters, Laura and Victoria, who has a disability and requires care. We set up a GoFundMe so that Axinia can provide Victoria with the care she needs. If you can, please contribute:
๐ฏ๐ฏ๐ฏ โAI is softwareโ will also help people understand that AI has vulnerabilities and may malfunction just like any software, and we donโt have to prove a software to be bug-free before deploying it.
@ylecun
I sometimes wonder if saying AI is software would help in these contexts. Most people nowadays know what software is, and itโs also true that the existence of open-source software has not caused any specifics harm (afaik). On the contrary it helped a lot with scientific progress.
A group of current, and former, OpenAI employees - some of them anonymous - along with Yoshua Bengio, Geoffrey Hinton, and Stuart Russell have released an open letter this morning entitled 'A Right to Warn about Advanced Artificial Intelligence'.
"This offers a partial answer to the long-standing question: does code training improve reasoning abilities? We believe it does, at least for mathematical reasoning."
๐ DeepSeekMath: Approaching Mathematical Reasoning Capability of GPT-4 with a 7B Model.
Highlights:
- Continue pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math tokens from Common Crawl.
- Introduce GRPO, a variant of PPO, that enhances mathematical reasoning and reduces
Just when I thought I canโt be more excited about my internship at
@MetaAI
this coming summer :) Hope we can get an open source large code LM like codex soon!
Today Meta AI is sharing OPT-175B, the first 175-billion-parameter language model to be made available to the broader AI research community. OPT-175B can generate creative text on a vast range of topics. Learn more & request access:
Additional studies show that the learning of LEVER is data efficient and the learned knowledge is transferable across different LLMs for the same task. (5/n)
@TaliaRinger
@DynamicWebPaige
An author here๐โโ๏ธWe actually debated a lot on whether to use the word โverifyโ as (formal) verification indeed means very different things in PL. But in the end we feel we need to be consistent with prev works on โverifiersโ like
Lastly, I want to thank everyone that Iโve shared a meal, a coffee/drink or even just a conversation. Iโm like 96.2% sure that Iโm the only one from Yale thatโs attending in person this year. Thanks for keeping me company and making me feel Iโm not alone. 4/n,n=4
Has anyone tested DeepSeek-Coder on APPS? Was reviewing a paper and saw DS-Coder *6.7B* is 12% better than GPT-4 on APPS?? But according to DS's paper it's definitely worse than GPT-4 on other benchmarks. Or is it an open secret that DS-Coder is SFT-ed on APPS
print("Hello world! ๐")
Excited to announce the BigCode project led by
@ServiceNowRSRCH
and
@huggingface
! In the spirit of BigScience we aim to develop large language models for code in an open and responsible way.
Join here:
A thread with our goals๐งต
More results for the InCoder and CodeGen models are shown here. Ablation studies show that exec info is essential for the success of LEVER, and it works seamlessly with weakly-supervised settings w/o large perf drop. (4/n)
The risk of taking โlow-hanging fruitโ in AI4Code research is not longer just getting scooped by other researchers, but also by companyโs new products๐คฆโโ๏ธ So we gotta dream BIG now๐
Microsoft is releasing Github Copilot X ๐
It includes:
โข AI-generated answers from code docs
โข Chat interface for code suggestions
โข Copilot for the command line
โข Voice interface with Copilot
โข Copilot for pull requests
Okay, NOW it's so over.