Really excited about the launch of this research initiative. Hiring Research Scientists now. Research Software Engineers and postdocs over next few months. 300 H100 GPUs. Multidisciplinary teams. Princeton helps keep AI expertise in the open sphere. More:
“The dramatic rise of AI capabilities…is a watershed event for humanity…It is also sure to transform research and teaching in every academic discipline.” –
@prfsanjeevarora
, director of the new
@Princeton
Language and Intelligence initiative. For more:
Conventional wisdom: "Not enough data? Use classic learners (Random Forests, RBF SVM, ..), not deep nets." New paper: infinitely wide nets beat these and also beat finite nets. Infinite nets train faster than finite nets here (hint: Neural Tangent Kernel)!
"Is optimization the right language to understand the brain?" is a famous controversy in neuroscience. My new blog post asks if optimization is the right language even to understand deep learning? (TL;DR: let's think: trajectories!)
Princeton has a new Center for Language and Intelligence, researching LLMs + large AI models, as well as their interdisciplinary applications. Looking for postdocs/research scientists/engineers; attractive conditions.
Conventional wisdom: slowly decay learning rate (lr) when training deep nets. Empirically, some exotic lr schedules also work, eg cosine. New work with Zhiyuan Li: exponentially increasing lr works too! Experiments + surprising math explanation. See
Blogpost on our new theory for word2vec-like representation learning methods for images, text, etc. Explains why representation do well on previously unseen classification tasks Relevant to meta learning, transfer learning? Paper
Workshop: "Theory of Deep Learning: Where Next?" at the Institute for Advanced Study, Tuesday--Friday this week. Amazing schedule of talks!
Registration is closed (sorry), but follow livestream here
Big congratulations to Avi Wigderson of IAS Princeton for winning the Turing Award in CS. Truly an all-time great in theoretical computer science and discrete math. Also one of the nicest human beings I know --friend and mentor to so many (including me)
Our long-delayed blogpost on ICLR20 paper that shows current deep nets can be trained with learning rate that is exponentially increasing. Not just experiments but also a mathematical proof that this is at least as powerful as usual LR tuning.
How do you compute with an infinitely wide deep net (eg, AlexNet or VGG with width taken to infinity)? Despite crazy overparametrization, this net works OK on finite dataset CIFAR10. To understand how this was done ( via "Neural Tangent Kernels") see
Deep-learning-free text embeddings. Surprisingly simple text embeddings suffice to match the performance of much more sophisticated methods for capturing the meaning of text.
Contrastive learning gives great data representations. New paper (title is a homage to Zhang et al'16) says understanding requires opening the black box of deep learning).
(Note: Lead author Nikunj Saunshi, is on the job market.)
We're looking for postdoctoral fellows in AI! We offer: excellent cohort of young researchers, dedicated GPU cluster with 300H100s, $100K salary (+$10k research funds), stunning campus. 1 hour from NYC and Philly. Renewable, i.e., possible to stay multiple years. Join us!
Excited to announce the Princeton Language and Intelligence Postdoctoral Research Fellowship!
Candidates are encouraged to apply by the start-of-review date, Friday, December 1, 11:59 pm (EST), for full consideration.
Details:
We're hiring Research Engineers and Research Scientists now, and postdocs in the winter. Please join us in developing AI as well as apply it to academic disciplines including in humanities, social sciences and the sciences.
With sparse coding again popular for interpretability in LLMs please look at older work! "Latent structure in word embeddings" , Atoms of meaning" , Decoding brain fMRI via sentence embeddings
Fine tuned LLMs can solve many NLP tasks. A priori, fine-tuning a huge LM on a few datapoints could lead to catastrophic overfitting. So why doesn’t it? Our theory + experiments (on GLUE) reveal that fine-tuning is often well-approximated as simple kernel-based learning. 1/2
New blog post by Nadav Cohen. If we want to understand deep learning, we have to start analysing the trajectory of gradient descent rather than the landscape. . The paper is here
Matching Alexnet performance (89%) on CIFAR10 using kernel method. Excluding deep nets, previous best was 86% (Mairal NIPS'16). Key Ideas: convolutional NTK + Coates-Ng random patches layer + way to fold data augmentation into kernel defn
Hoping to read new papers by Allen-Zhu et al. Training provably converges on greatly overparametrized deep nets. And such overparametrized deep nets can generalize when trained on data from teacher net. and
Shiller's advice is good in any field. Easy but sad explanation for why young people often ignore this advice : (N+1)th result in a field with N results is difficult to obtain, hence easy to publish. The 1st or 2nd result in a field are easier to obtain, but harder to publish.
Remember matrix completion? Deep linear nets solve it better than the old nuclear norm algorithm. Analysis requires going beyond traditional optimization view and understanding
#trajectories
. Blog post by Nadav and Wei: . Paper
New mathematical explanation of lack of barriers in deep learning landscape (i.e., low-cost solutions interconnected via regions of low cost; ICML18). Applies to realistic deep nets and uses noise stability property. Rong Ge's blog post about our paper
Has deep learning overfitted to test sets of popular datasets? Move over Occam! Rip Van Winkle's Razor gives nontrivial upper bounds on amount of overfit for popular architectures. Blog post + article with Yi Zhang
Saliency maps give “human interpretability” to deep learning. NIPS18 paper (
@mrtz
@goodfellow_ian
@_beenkim
) showed they fail “sanity checks” involving model and data randomization. We fix saliency maps to pass sanity checks ("Competition for pixels")
Day long event at the Institute for Advanced Study on Fri Feb 22. Deep Learning: Alchemy or Science? Speakers: Mike Collins,
@ylecun
,
@zacharylipton
, Joelle Pineau, Shai Shalev Schwartz. Will be livestreamed. Panel will respond to qs from worldwide audience via twitter.
Giving three talks for ETH Zurich Paul Bernays Lecture 2022. "The quest for mathematical understanding of artificial intelligence." . This week's two talks are accessible to non-experts.
Blog post on new mismatches between current theories of optimization and modern deep learning. Tiny Learning Rates don't hurt generalization. Surprising insight about fast mixing in landscape and what it means. New theory with
@zhiyuanli_
and
@vfleaking
.
Simons Foundation and NSF propose to spend $20M to fund projects on Mathematical and Scientific Foundations of Deep Learning
An interesting public-private partnership to fund basic research.
2nd article on Deep Learning Free Text embeddings that are easy, trivial and fast to implement, and compete quite well with way more complicated embeddings.
My new paper (joint with Nadav Cohen and Elad Hazan) on the benefits of overparametrization is up .
I recommend Nadav's nice blog post as a starting point:
Visited the new
@GoogleAI
lab in Palmer Square, Princeton and enjoyed the excellent coffee with my colleague (and lab co-director)
@HazanPrinceton
. Exciting times for machine learning and AI in Princeton NJ!
Theory of Deep Learning: Where Next? Workshop
@the_IAS
Princeton Oct 15-18 2019.
Great speaker lineup! Registration open. Contributed paper/talk/poster submission deadline Sept 2.
Blog returns from summer. New article by Simon Du and Wei Hu on Neural Tangent Kernels (which capture the power of infinitely wide nets trained on finite datasets).
. Watch out for more in coming weeks!
Computing Convolutional Neural Tangent Kernels (CNTK) for 20-layer nets with pooling layer is computationally expensive and many people wrote to us wondering how it is feasible. Short answer: these students not only have great theory chops, but can also write CUDA!
We have released code for computing Convolutional Neural Tangent Kernel (CNTK) used in our paper "On Exact Computation with an Infinitely Wide Neural Net", which will appear in NeurIPS 2019.
Paper:
Code:
Seminar series in theoretical ML is continuing online this summer at
@the_IAS
.
Upcoming speakers:
@mraginsky
(today at 12:20pm!), Mike Jordan, Shankar Sastry, etc. Registration required.
Congratulations to
@StanfordAILab
Director
@chrmanning
, awarded the 2024 IEEE John von Neumann Medal, one of
@IEEEAwards
’s top awards “for outstanding achievements in computer-related science and technology”, for his advances in
#NLProc
.
Princeton Language and Intelligence Initiative looking for Research Scientists (PhD reqd). Foci: (i) Foundation Models, LLMs (ii) Applications of models to other disciplines. (iii) Understanding effects on society and mitigating harms. Lets Chat at ICML23?
#IASMLyear
Special year in machine learning, optimization and statistics 2019-20; Institute for Advanced Study. Visit with stipend for term or a year; shorter visits possible for industry folks. Apply by Dec 1
How do you induce embeddings for a word from a single or few occurences? Simple method that also improves unsupervised sentence embeddings: A la carte embeddings.
Also how diff. meanings of word reside inside its embedding (TACL )
Five amazing expositions of zero-knowledge proofs by Amit Sahai of UCLA aimed at five v. different types of listeners. Heartening to see a mathy video rack up millions of views in a few weeks.
Efficient Covid testing: how to test more patients with fixed number of test kits. Cool applications of math concepts we teach our students: coding theory, compressed sensing, etc.
New blog post describes our new paper (with Rong Ge, Behnam Neyshabur, Yi Zhang) making progress on the generalization mystery of deep nets. The bounds are orders of magnitude better than recent papers.
The boundary between trainable and untrainable neural network hyperparameter configurations is *fractal*! And beautiful!
Here is a grid search over a different pair of hyperparameters -- this time learning rate and the mean of the parameter initialization distribution.
Very excited about this paper and its implications. Turing-completeness of transformers implies they can simulate other models inside them. But it's nontrivial a net can do gradient updates on another net insided them which is 1/8th the size. Great work by the student team!
**New paper **
In-context learning was explained as simulate + train simple models at inference. We show a 2B model can run GD on an internal 125M model. Surprising simulation + AI safety implications! 1/5
w/
@SadhikaMalladi
,
@xiamengzhou
,
@prfsanjeevarora
Panel discussion at 4:30pm in IAS workshop "Theory of Deep Learning: Where Next?" Panelists include
@ylecun
,
@chrmanning
, Srebro, Bottou, Collins, Kakade etc. Please tweet your questions for the panel in response to this.
Postdoc positions in theoretical machine learning at Princeton CS Dept. Relevant faculty include Elad Hazan, Ryan Adams, Yoram Singer, and me. Mention in cover letter which faculty you are interested in. Best to apply by Dec 15; latest by Jan 10.
Good to see the leader in this week's Economist about large language models. Covers questions many of the issues being discussed in AI/ML, including nature of "intelligence", huge training cost (and "rich getting richer"), scaling phenomena, geopolitics.
“Foundation models” represent a breakthrough in artificial intelligence or AI. They are a new form of creative, non-human intelligence and promise to bring great benefits
My talk at
@mitidss
on theory for contrastive unsupervised representation learning (word2vec-like methods popular for learning embeddings of images, text, molecules etc.).
Paper (with amazing student group) is here . Blog post soon!
I was quite curious what OpenAI's preparedness unit is working on, and
@aleks_madry
gave a good high-level view in our Princeton Alignment and Safety seminar
Kudos to
@SadhikaMalladi
and
@YangsiboHuang
for interesting followup Q&A
Article on (i) theory of emergence of complex skills in LLMs (ii) SKILL-MIX eval -- shows LLMs able to use skills combos not seen during training.
@QuantaMagazine
's thoroughness and quality exemplary! Quotes
@geoffreyhinton
. Video of related talk
“Stochastic parrots” generate text only by combining information they have already seen, not through any understanding of their own. Are ChatGPT, Bard and other large chatbots simply parroting their training data? The answer is probably no.
Nontrivial generalization bounds on deep nets are tough. PGDL competition (Neurips20) promoted empirical study of predictors of generalization error. Our ICLR22 spotlight aced PGDL testbed. Idea: estimate w synthetic data from GANs trained on training data
Speaking tomorrow (Friday) 2pm in
@icmlconf
workshop on Theor. Physics in Deep Learning. Title: "Is Optimization a sufficient language to understand deep learning?" (Also, grad Orestis speaking Thurs 4pm about our work on word2vec-like methods for representation learning.)
Congratulations to
@tengyuma
for honorable mention in 2018 ACM Doctoral Dissertation award! . Congrats also to
@chelseabfinn
and Ryan Beckett. Tengyu and Ryan were both Princeton grad students!
Fantastic popular lecture by Stanford's Chris Manning on Natural Language Processing and Deep Learning. Best popular introduction I know of to the mysteries of language and how to teach machines to understand them.
Very interesting papers
@ZeyuanAllenZhu
. This trick is very interesting. I recall hearing evidence OpenAI does label training data with source/provenance (the LLM sometimes spits out those memorized labels). Can't remember where/who I learnt this from
Result 10/11/12: surprisingly, when pre-training good data (e.g., Wiki) together with "junks" (e.g., Common Crawl), LLM's capacity on good data may decrease by 20x times! A simple fix: add domain tokens to your data; LLMs can auto-detect domains rich in knowledge and prioritize.
Yoshua Bengio, Geoffrey Hinton and Yann LeCun, the fathers of
#DeepLearning
, receive the 2018
#ACMTuringAward
for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing today.
Cohen et al. 2021 showed that gradient descent in deep nets doesnt operate acc. to traditional optimization: it operates beyond "Edge of stability." New paper with
@zhiyuanli_
@Abhishek_034
analyses GD beyond EoS and shows sharpness reduction benefit.
Proof of convergence to global optimum for gradient descent on linear neural networks (joint w/
@prfsanjeevarora
@Hoooway
Noah Golowich) --- check it out tomorrow in
#NeurIPS2018
DL theory workshop poster session (220D, 3PM)!
Research Software Engineer positions in AI! Enable core AI research & interdisciplinary applications at Princeton. SoTA GPU cluster with 300 Nvidia H100s. Attractive and collaborative work environment. Positions based in Princeton (but flexible work setup), starting asap.
Launching blog
@PrincetonPLI
with a post on skillmix. LLMs aren't just "stochastic parrots."
@geoffreyhinton
recently mentioned this as evidence that LLMs do "understand" the world a fair bit. More blog posts on the way! (Hinton's post here: )
Our paper on provably efficient algorithms for topic modeling finally appeared in CACM.
Many people use these methods instead of older EM or MCMC approaches.
Excited about this new work from our group. Local SGD will be increasingly important as distributed training strategies (with asynchronous) updates will allow more flexible training of large AI models. Great theory and experiments, kudos to
@hmgxr128
and
@vfleaking
!
Local SGD, though designed to reduce communication, can generalize better than SGD! Our
#ICLR2023
paper gives the first theoretical explanation of this phenomenon: local steps inject extra noise, driving the iterate to drift faster to flatter minima on the minimizer manifold. 1/4
Fine-tuning language models using just forward pass! Our paper should interest you if you have enough GPU memory to evaluate your model but not enough for efficient backpropagation. Zeroth order optimization is an old idea but there are subtleties and tricks in making this work!
Introducing MeZO - a memory-efficient zeroth-order optimizer that can fine-tune large language models with forward passes while remaining performant. MeZO can train a 30B model on 1x 80GB A100 GPU.
Paper:
Code:
I will present my thesis defense tomorrow!
Language Agents: From Next-Token Prediction to Digital Automation
- 10am EST on Thursday, May 2
-
- WebShop, ReAct, ToT, CoALA
- Briefly: SWE-bench/agent
- Thoughts on the future of language agents
Skeptical about deep learning theory that uses continuous formulations (e.g. SDE) to reason about discrete Stochastic Gradient Descent? Don't miss this poster today.
Stochastic Differential Equation (SDE) has been widely used to model and understand SGD, e.g., the famous Linear Scaling Rule follows directly from it.
But is this heuristic approximation really valid in deep learning practice?
paper:
🧵(1/5)
Excited about our new work from
@PrincetonPLI
. Our grads never cease to amaze us. It's better to use just 5% of the instruction-tuning data (suitably selected) instead of the full dataset.
Lots of instruction tuning data out there...but how to best adapt LLMs for specific queries? Don’t use ALL of the data, use LESS! 5% beats the full dataset. Can even use one small model to select data for others!
Paper:
Code: [1/n]
Researchers
@PrincetonPLI
have created an autonomous AI software engineer that’s free and open source.
💻 Called SWE-agent, it uses an LLM, like GPT-4, to automatically fix coding problems in GitHub.
🤯 It can solve problems in about 90 seconds with high accuracy
Encoder-decoder GANs architectures still don't fix the theoretical problems in GANs framework such as mode collapse. Encoders may produce nonsense codes and the discriminator is none the wiser. Blog post and ICLR'18 paper
Deepseek's new VLM is very impressive. But p 7 of mentions they trained on 1M books from "Annas Archive", i.e., illegal downloads. That's 100B very high-quality tokens. Dark new world...
Today we're excited to introduce Devin, the first AI software engineer.
Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork.
Devin is
LLMs can exhibit unsafe behaviors after fine-tuning on perfectly benign-looking data. To avoid this, it is best to ignore recommended fine-tuning best practices (eg on Llama2). TL;DR: fine-tune without the recommended safety prompt, but use the safety prompt at inference.
Fine-tuning can improve chatbots (e.g., Llama 2-Chat, GPT-3.5) on downstream tasks — but may unintentionally break their safety alignment.
Our new paper: Adding a safety prompt is enough to largely mitigate the issue, but be cautious about when to add it!
Our paper on generalization bounds for deep nets (joint with Rong Ge, Behnam Neyshabur, and Yi Zhang) is here Is uses a new approach based upon direct compression. See also my blog post on
NSF funding large projects in infrastructure for computing. Deep learning (eg foundation models) an obvious use. Hoping universities are looking at this. Contact me if you need Princeton as partner.
The Chinchilla paper is one of my favorite papers of the last few years
I love that they actually came up with a law for training models. Very few papers bold enough to make that claim & back it up with excellent experiments
In our second PLI Blog post, authors
@_carlosejimenez
and
@jyangballin
describe testing LLMs for challenges that software engineers face everyday.
Read it here:
Proof that you don't need Olympiad golds for building towards a better Devin if you have open source.
(although at Princeton, Olympiad medalists are so commonplace that even if they exist, they don't bother to mention them)