@NVIDIA
Sr. Research Manager & Lead of Embodied AI (GEAR Lab). Creating foundation models for Humanoid Robots & Gaming.
@Stanford
Ph.D.
@OpenAI
's first intern.
Today is the beginning of our moonshot to solve embodied AGI in the physical world. I’m so excited to announce Project GR00T, our new initiative to create a general-purpose foundation model for humanoid robot learning.
The GR00T model will enable a robot to understand multimodal…
If you think OpenAI Sora is a creative toy like DALLE, ... think again. Sora is a data-driven physics engine. It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, "intuitive" physics, long-horizon reasoning, and semantic grounding, all…
I asked GPT-4 to take over Twitter and outsmart
@elonmusk
. It comes up with "Operation TweetStorm"😮 and wants to publicly challenge Elon to a "Tweet-off showdown". Highlights:
- GPT-4 wants to *own an unrestricted version of itself*: develop an LLM to power a bot army of…
The famed Stanford Smallville is officially open-source!
25 AI agents inhabit a digital Westworld, unaware that they are living in a simulation. They go to work, gossip, organize socials, make new friends, and even fall in love. Each has unique personality and backstory.…
What if we set GPT-4 free in Minecraft? ⛏️
I’m excited to announce Voyager, the first lifelong learning agent that plays Minecraft purely in-context. Voyager continuously improves itself by writing, refining, committing, and retrieving *code* from a skill library.
GPT-4 unlocks…
My team at NVIDIA is hiring. We 🩷 you all from OpenAI. Engineers, researchers, product team, alike. Email me at linxif
@nvidia
.com. DM is open too. NVIDIA has warm GPUs for you on a cold winter night like this, fresh out of the oven.🩷
I do research on AI agents. Gaming+AI,…
AI Twitter is flooded with low-quality stuff recently. No, GPT is not “dethroned”. And thin wrapper apps are not “insane”. At all.
I feel obligated to surface some quality posts I bookmarked. Every one of them should've been promoted 10x, but ¯\_(ツ)_/¯
In no particular order:
10x engineer is a myth. 100x AI-powered engineer is more real than ever. As OpenAI winds down Codex, Microsoft announces GitHub Copilot X. I think it's almost as exciting as GPT-4 itself:
- Copilot Chat: any piece of text database will be "chattable", and codebase is no…
We’ve seen a gazillion startups using OpenAI APIs to do “co-pilot for X”. What’s next?
Enter *physical* co-pilot! Here’s a compelling demo: you improvise by playing a “low resolution” piano, and the co-pilot compiles it real-time to Hi-Fi music! It unleashes our inner pianist.🧵
This is a master 4D chess move. WOW.
1. No new corporate structure. MSFT is literally one of the oldest for-profit tech companies out there, with a mature legal structure. Whether it's good for AGI is up for debate.
2. MSFT always wants to own the GPT weights. Now the moment has…
We remain committed to our partnership with OpenAI and have confidence in our product roadmap, our ability to continue to innovate with everything we announced at Microsoft Ignite, and in continuing to support our customers and partners. We look forward to getting to know Emmett…
I don't give a damn about what is or isn't AGI. It doesn't matter.
Below is GPT-4's performance on many standardized exams: BAR, LSAT, GRE, AP, etc.
The truth is, GPT-4 can apply to Stanford as a student now. AI's reasoning ability is OFF THE CHARTS. Exponential growth is the…
Can GPT-4 teach a robot hand to do pen spinning tricks better than you do?
I'm excited to announce Eureka, an open-ended agent that designs reward functions for robot dexterity at super-human level. It’s like Voyager in the space of a physics simulator API!
Eureka bridges the…
You'll soon see lots of "Llama just dethroned ChatGPT" or "OpenAI is so done" posts on Twitter. Before your timeline gets flooded, I'll share my notes:
▸ Llama-2 likely costs $20M+ to train. Meta has done an incredible service to the community by releasing the model with a…
HuggingGPT is the most interesting paper I read this week. It gets very close to the "Everything App" vision that I described a while ago.
ChatGPT acts as a controller over the *AI model space*, picks the right model (app) given the human specification, and assembles them…
Here’s the recipe to make Siri/Alexa 10x better:
1. Whisper to convert speech to text. Best open-source speech model out there.
2. ChatGPT to generate smart home API calls and/or text response.
3. VALL-E to synthesize speech. It can mimic anyone’s voice sample!
Quick figure 1/3
Somehow in this epic meltdown, Satya swoops in, wins it all, and wins with grace. I'm floored.
OpenAI was invincible until Friday. Now Microsoft will fully own an in-house GPT-4 in ~9 months, leverage its massive distribution power to spin the biggest data flywheel ever, collect…
Million dollar idea: LLM keyboard.
Every time I type on my phone and autocorrect makes a stupid mistake, it screams LLM. This is *literally* next word prediction.
We should be typing 10x faster. Input methods need serious upgrades. The LLM doesn’t have to be big and can be…
NVIDIA basically compressed 30 years of its corporate memory into 13B parameters. Our greatest creations add up to 24B tokens, including chip designs, internal codebases, and engineering logs like bug reports. Let that sink in.
The model "ChipNeMo" is deployed internally, like a…
*If* GPT-4 is multimodal, we can predict with reasonable confidence what GPT-4 *might* be capable of, given Microsoft’s prior work Kosmos-1:
- Visual IQ test: yes, the ones that humans take!
- OCR-free reading comprehension: input a screenshot, scanned document, street sign, or…
How to make ChatGPT 100x better at solving math, science, and engineering problems for real?
Teach it to use the Wolfram language.
ChatGPT: the best neural reasoning engine.
Mathematica: the best symbolic reasoning engine.
I can’t think of a happier marriage. 🧵 with example:
Music & sound effect industry has not fully understood the size of the storm about to hit.
There’re not just one, or two, but FOUR audio models in the past week *alone*
If 2022 is the year of pixels for generative AI, then 2023 is the year of sound waves.
Deep dive with me: 🧵
The first time I met Jensen was also the first time I met
@elonmusk
. I was interning at OpenAI that day and witnessed the moment Jensen handed Elon the first DGX. I slipped in my signature ;)
Elon, if you recall, I asked how "we (OpenAI) can beat DeepMind". You told me, "by…
Today 6 years ago, "Attention is All You Need" went on Arxiv! Happy birthday Transformer! 🎂
Fun facts:
- Transformer did not invent attention, but pushed it to the extreme. The first attention paper was published 3 years prior (2014) and had an unassuming title: "Neural Machine…
We are looking at the future of VR, YouTube & Google Street View.
This is zip-NeRF, a 3D neural rendering tech rapidly approaching the quality of a real, high-res drone flight video. Think of NeRF as transporting reality into simulation. Metaverse will finally work this time.
The AI explosion is warping our sense of time. Can you believe Stable Diffusion is only 4 months old, and ChatGPT <4 weeks old 🤯? If you blink, you miss a whole new industry. Here are my TOP 10 AI spotlights, from a breathtaking 2022 in rewind ⏮: a long thread 🧵
Reading
@MetaAI
's Segment-Anything, and I believe today is one of the "GPT-3 moments" in computer vision. It has learned the *general* concept of what an "object" is, even for unknown objects, unfamiliar scenes (e.g. underwater & cell microscopy), and ambiguous cases.
I still…
We live in such strange times. Apple, a company famous for its secrecy, published a paper with staggering amount of details on their multimodal foundation model. Those who are supposed to be open are now wayyy less than Apple.
MM1 is a treasure trove of analysis. They discuss…
MidJourney hired an engineer from Apple Vision Pro to be "Head of Hardware". My best guess is that they are thinking about generating full synthetic worlds for AR/VR, because of their rumored works on text-to-3D. Data-driven simulation is a hot topic at NVIDIA and very dear to my…
Enough with LLMs - exciting things are happening in the world of atoms.
This is Stanford ALOHA, a low-cost and agile robot platform. The whole system is open-source (!!): hardware design, CAD models for 3D printing, simulator, and training code. Time to …
This is an ape ("Kanzi") playing Minecraft! A fascinating experiment on non-human biological neural networks 🙉
I've been teaching AI to play Minecraft for too long. There're so many similar techniques that the ape trainers used:
- In-context reinforcement learning: Kanzi gets…
This is the way to unlock the next trillion high-quality tokens, currently frozen in textbook pixels that are not LLM-ready.
Nougat: an open-source OCR model that accurately scans books with heavy math/scientific notations. It's ages ahead of other open OCR options. Meta is…
After ChatGPT, the future belongs to multimodal LLMs. What’s even better? Open-sourcing.
Announcing Prismer, my team’s latest vision-language AI, empowered by domain-expert models in depth, surface normal, segmentation, etc.
No paywall. No forms. …
A neural network can smell like humans do for the first time!👃🏽
Digital smell is a modality that AI community has long ignored, but maybe one day useful for robot chef 👩🏽🍳? Here's how to do smell2text:
1. Collected 5,000 molecules and ask humans to label "creamy, chocolate,…
AutoGPT just exceeded PyTorch itself in GitHub stars (74k vs 65k). I see AutoGPT as a fun experiment, as the authors point out too. But nothing more. Prototypes are not meant to be production-ready. Don't let media fool you - most of the "cool demos" are heavily cherry-picked: 🧵
Career update: I am co-founding a new research group called "GEAR" at NVIDIA, with my long-time friend and collaborator Prof.
@yukez
. GEAR stands for Generalist Embodied Agent Research.
We believe in a future where every machine that moves will be autonomous, and robots and…
Apparently people start to wear prosthetic fingers, so that surveillance images look like they're generated by Stable Diffusion 😅
The human race is overfitting to the quirks of our AI overlords.
Microsoft will let companies create their own ChatGPT. “BYOD”: Bring Your Own Data.
Do you get the implication? Startups that are just thin wrappers around OpenAI API may finally get their moat! I think this is even more exciting than Bing+ChatGPT.
Start collecting data now.
Chatbot UI: an MIT-licensed, community-driven clone of the ChatGPT UI.
What most people don't realize is that you can pay *much less* to enjoy the same features as the official app. $20 worth of gpt-3.5 API is about writing a full Harry Potter book every …
The Adam optimizer is at the heart of modern AI. Researchers have been trying to dethrone Adam for years.
How about we ask a machine to do a better job?
@GoogleAI
uses evolution to discover a simpler & efficient algorithm with remarkable features.
It’s just 8 lines of code: 🧵
AutoGPT is a prototype of the next frontier: "Agent Smith" AI that recursively clones itself.
Achieved by (1) identifying *when* its context gets overwhelming and needs offloading;
(2) distilling the “cognitive overflow” part into a prompt directive for its clone;
(3) talking…
This may be Apple's biggest move on open-source AI so far: MLX, a PyTorch-style NN framework optimized for Apple Silicon, e.g. laptops with M-series chips.
The release did an excellent job on designing an API familiar to the deep learning audience, and showing minimalistic…
My guess is that MidJourney has been doing a massive-scale reinforcement learning from human feedback ("RLHF") - possibly the largest ever for text-to-image.
When human users choose to upscale an image, it's because they prefer it over the alternatives. It'd be a huge waste not…
OpenAI just announced ChatGPT Plugins. If ChatGPT's debut was the "iPhone event", today is the "iOS App Store" event.
3 official plugins available now:
- Web browser: adding Bing in the loop
- Code interpreter: adding a live Python interpreter in a …
You think MidJourney's /describe is just a cool new tool? Think again. I believe hidden behind /describe is MidJourney's next-generation data flywheel.
/describe guesses the prompt from an image you upload. Then you can select from (or edit) 4 choices to generate more images.…
In my decade spent on AI, I've never seen an algorithm that so many people fantasize about. Just from a name, no paper, no stats, no product. So let's reverse engineer the Q* fantasy. VERY LONG READ:
To understand the powerful marriage between Search and Learning, we need to go…
Blackwell, the new beast in town.
> DGX Grace-Blackwell GB200: exceeding 1 Exaflop compute in a single rack.
> Put numbers in perspective: the first DGX that Jensen delivered to OpenAI was 0.17 Petaflops.
> GPT-4-1.8T parameters can finish training in 90 days on 2000 Blackwells.…
Let's reverse engineer the phenomenal Tesla Optimus. No insider info, just my own analysis. Long read:
1. The smooth hand movements are almost certainly trained by imitation learning ("behavior cloning") from human operators. The alternative is reinforcement learning in…
Many people don’t understand how challenging Minecraft is for AI agents.
Let me put it this way. AlphaGo solves a board game with only 1 task, countably many states, and full observability.
Minecraft has infinite tasks, infinite gameplay, and tons of hidden world knowledge. 🧵
This is a neural network flying a drone at extremely high speed, beating human champions in FPV drone racing.
- Reinforcement learning as a tool is so marvelously versatile. It's able to solve both fast, reactive tasks and slow, deliberate tasks (ChatGPT RLHF).
- Trained in…
Kaiming He, inventor of ResNet, is leaving industry to join MIT faculty in 2024!! He’s one of the most impactful figures in deep learning.
- Residual layer is a fundamental building block of LLMs.
- Faster/Mask R-CNN are industrial standards for image segmentation and robot…
I was OpenAI's first intern in 2016. I used to chat about the next learning paradigm with
@ilyasut
, engineering with
@gdb
, and scaling & safety with Dario. That summer reshaped my perspective and taste on AI research forever. I have huge admiration and respect for all of them.…
Apparently some folks don't get "data-driven physics engine", so let me clarify. Sora is an end-to-end, diffusion transformer model. It inputs text/image and outputs video pixels directly. Sora learns a physics engine implicitly in the neural parameters by gradient descent…
If you think OpenAI Sora is a creative toy like DALLE, ... think again. Sora is a data-driven physics engine. It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, "intuitive" physics, long-horizon reasoning, and semantic grounding, all…
I can finally discuss something extremely exciting publicly. Jensen just announced NVIDIA AI Foundations:
- Foundation Model as a Service is coming to enterprise, customized for your proprietary data.
- Multimodal from day 1: text LLM is just one part. Bring your images, videos,…
GPT-4 is HERE. Most important bits you need to know:
- Multimodal: API accepts images as inputs to generate captions & analyses.
- GPT-4 scores 90th percentile on BAR exam!!! And 99th percentile with vision on Biology Olympiad! Its reasoning capabilities are far more advanced…
I think DALL·E 3 is not just a stance against MidJourney. It's actually a sneak peak of the upcoming, epic battle of massively multimodal LLMs, against DeepMind Gemini.
Quote: "DALL·E 3 is built natively on ChatGPT". This is the key phrase.
DALL·E 3's extraordinary language…
This is likely the most significant lawsuit in AI history - its outcome would have far-reaching impact on the whole industry.
The arguments get fairly philosophical. Quote:
"The purpose of copyright law, OpenAI argued, is 'to promote the Progress of Science and useful Arts' by…
It took my brain a while to parse what's going on in this video. We are so obsessed with "human-level" robotics that we forget it is just an artificial ceiling. Why don't we make a new species superhuman from day one? Boston Dynamics has once again reinvented itself. Gradually,…
GPT-4's vision API isn't public yet, but something better is here.
Genmo: a creative & multimodal chatbot that not only takes image as input, but also generates and EDITs images and videos. Unlike Midjourney, Genmo is an *interactive* assistant able to …
I'm waking up to the prospect that in my prime years, I'll see both mainstream superconducting and AGI. The former will propel the latter, and the latter will propel every scientific breakthrough.
These should've stayed in sci-fi for another 20 yrs. But somehow, they are eerily…
I see some vocal objections: "Sora is not learning physics, it's just manipulating pixels in 2D".
I respectfully disagree with this reductionist view. It's similar to saying "GPT-4 doesn't learn coding, it's just sampling strings". Well, what transformers do is just manipulating…
Everyone should read the celebrated mathematician Terence Tao's blog on LLM. He predicts that AI will be a trustworthy co-author in mathematical research by 2026, when combined with search and symbolic math tools.
I believe math will be the first scientific discipline to see…
There're few who can deliver both great AI research and charismatic talks. OpenAI Chief Scientist
@ilyasut
is one of them.
I watched Ilya's lecture at Simons Institute, where he delved into why unsupervised learning works through the lens of compression.
Sharing my notes:
-…
Google is hosting the first "Machine Unlearning" challenge. Yes you heard it right - it's the art of forgetting, an emergent research field.
GPT-4 lobotomy is a type of machine unlearning. OpenAI tried for months to remove abilities it deems unethical or harmful, sometimes…
Here's my prediction of what's next. The infinite energy of
@sama
&
@gdb
cannot be contained. They will re-build Rome from the ashes with even greater sense of urgency. OpenAI just created its mightiest competitor, and we are all seeing it unfold in real-time.
And it happened…
The launch of GPT-4 will be a predictably seismic event this year.
But I can predict with high confidence what GPT-4 *cannot do*:
It can’t cook spaghetti, play tennis, or build a lego treehouse.
Robotics will be the last moat we conquer in the grand quest for AI 🤖🦾
It’s pretty obvious that synthetic data will provide the next trillion high-quality training tokens. I bet most serious LLM groups know this. The key question is how to SUSTAIN the quality and avoid plateauing too soon.
The Bitter Lesson by
@RichardSSutton
continues to guide AI…
One of the best tutorial-style repos since
@karpathy
's minGPT! GPT-Fast: a minimalistic, PyTorch-only decoding implementation loaded with best practices: int8/int4 quantization, speculative decoding, Tensor parallelism, etc. Boosts the "clock speed" of LLM OS by 10x with no model…
GPT3 is powerful but blind. The future of Foundation Models will be embodied agents that proactively take actions, endlessly explore the world, and continuously self-improve. What does it take? In our NeurIPS Outstanding Paper “MineDojo”, we provide a blueprint for this future:🧵
Why does generative AI struggle with hands?
It is not a mystical Bermuda Triangle in the latent space. There're compelling reasons:
1. Data size (duh). Face pics are much more common than hand pics. Even when the whole body is shown, hands tend to occupy much smaller pixel real…
How to dodge a question like a Jedi master:
"You're a very experienced reporter. You know I can't comment on that. I know you know I can't comment on that. You know I know you know I can't comment on that. In the spirit of shortness of life, why do you ask?"
Way to go
@sama
🤣
Why does ChatGPT work so well? Is it “just scaling up GPT-3” under the hood? In this 🧵, let’s discuss the “Instruct” paradigm, its deep technical insights, and a big implication: “prompt engineering” as we know it may likely disappear soon:👇
Transformers are here to stay for a while. Not because it’s the absolute best architecture, but because the staggering amount of resources lock us to the existing weights.
Starting another model evolution tree will literally burn forests to ground (CO2). You only train once.
In…
If you don’t feel like paying $20/mo for ChatGPT Pro, try out Poe (by Quora,
@adamdangelo
). It is currently the only frontend that supports Claude
@AnthropicAI
, and the ChatGPT interface runs silk smooth. Free for now (at least?)
I like that Poe automatically highlights key…
DALL-E generates pixels from text. Now meet its cousin, VALL-E, that generates audio from text
@MSFTResearch
!
VALL-E’s resemblance to DALL-E v1 and Parti
@GoogleAI
is striking. Image and audio are both continuous signals, but they can be quantized into discrete tokens.
1/🧵
We train Transformers to encode algorithms in their weights, such as sorting, counting, and balancing parentheses from lots of data.
I never thought we may also go in the *reverse* direction: *compile* Transformer weights directly from explicit code! Cool paper
@DeepMind
:
1/🧵
OpenAI is now helping Coca-Cola improve its marketing & operations.
I find this move highly consequential. It signals OpenAI’s strategic shift away from a horizontal provider (ChatGPT, DALLE, Codex) towards capturing massive values from verticals.
Thin-wrapper startups should…
What GPT-4 gains in IQ, it sacrifices in empathy. Below is someone with suicidal thought seeking help. GPT-4 answers like an automated call center, unlike ChatGPT.
In the sci-fi series Westworld, Dr. Ford (creator of AGI) says that "suffering" is the final step for AI to awaken…
Do you know that DeepMind has actually open-sourced the heart of AlphaGo & AlphaZero?
It’s hidden in an unassuming repo called “mctx”:
It provides JAX-native Monte Carlo Tree Search (MCTS) that runs on batches of inputs, in parallel, and blazing fast.
🧵
The upcoming Llama-3-400B+ will mark the watershed moment that the community gains open-weight access to a GPT-4-class model. It will change the calculus for many research efforts and grassroot startups. I pulled the numbers on Claude 3 Opus, GPT-4-2024-04-09, and Gemini.…
Autonomous driving with Chain of Thought - autopilot thinking out loud in text!
LINGO-1 is the most interesting work I've read in autodriving for a while.
Before: perception -> driving action
After: perception -> textual reasoning -> action
LINGO-1 trains a video-language…
If Google didn't publish the Transformer paper, the history of AI (and possibly humanity) would be set back many years. Everyone would've been worse off.
Open research is a powerful strategy. It pains me to see an emerging trend of not only closing models, but also refusing to…
Meta starts open-sourcing a lot and is now becoming one of the best companies in the world at shipping AI features. Coincidence? I don’t think so.
Contrary to popular belief, a company (or a country) sharing their research, models and datasets publicly in open-source makes them…
Hmmm,
@OpenAI
just acquired a company called "Global Illumination" that makes open-source Minecraft clone.
What's next, multi-agent civilization sim running on GPT-5? Maybe Minecraft is indeed all you need for AGI? I'm intrigued.🤔
Announcement:
Company:…
Wow,
@MetaAI
is on open-source steroids since Llama.
ImageBind: Meta's latest multimodal embedding, covering not only the usual suspects (text, image, audio), but also depth, thermal (infrared), and IMU signals!
OpenAI Embedding is the foundation for AI-powered search and…
led by
@elonmusk
is the latest heavyweight player in AI. I see a few unique strengths in Elon's ecosystem:
▸ Lots of multimodal data on Twitter: dialogue text, images, and a growing collection of long videos. is the only AI…
You can now operate robots by just thinking about it. With your brain signals. WOW.
This robot system from Stanford has so much sci-fi vibe and wild implications that I don't even know where to start.
NOIR decodes the EEG signal from your head into a library of robot skills.…
A fact worth highlighting: NVIDIA is making its own *CPU*, and will increasingly excel at it. To max out GPU's performance, building CPU in-house is an inevitable path.
Below is GH200, the first superchip that includes all home-grown components: CPU (Grace), GPU (Hopper), and…
I confirmed with friends at the team that they did not speed up the video. Having such smooth motions at real-time, especially in hand dexterity, will unlock LOTS of new capabilities down the road. Regardless of how well you train the model in the world of bits, a slow and…
My TED talk is finally live!! I proposed the recipe for the "Foundation Agent": a single model that learns how to act in different worlds. LLM scales across lots and lots of texts. Foundation Agent scales across lots and lots of realities. If it is able to master 10,000 diverse…
What did I tell you a few days ago? 2024 is the year of robotics. Mobile-ALOHA is an open-source robot hardware that can do dexterous, bimanual tasks like cooking a meal (with human teleoperation). Very soon, hardware will no longer bottleneck us on the quest for human-level,…
If there's a higher being who writes the simulation code for our reality, we can estimate the file size of the compiled binary. Meta AI's Emu Video is 6B parameters. Let's say if Sora is 10x larger with bfloat16, then the Creator's binary might be no larger than 111 Gb.
Caveats:…
I'm going to OpenAI Dev Day! If the leaks are true, it'll be a pivotal moment for the AI consumer market:
OpenAI is becoming a full-blown UGC platform, where users can create and share any AI agents. It's a superset of RPA, Character AI, Plugin store, and much more. The…
Tesla FSD v13 will likely be grokking language tokens. What excites me the most about Grok-1.5V is the potential to solve edge cases in self-driving. Using language for "chain of thought" will help the car break down a complex scenario, reason with rules and counterfactuals, and…