Career Update: Today, I bid farewell to Waymo, marking the end of a chapter in my career.
Joining Waymo in late 2019, I entered a world where Level 4 (L4) robotaxi services were a concept rather than a reality. Now, in 2024, Waymo operates a rider-only robotaxi service in four
First week at Tesla. The vibe felt like a startup: few meetings, lots of hardcore building, flat team structure, direct communication across levels, super fast iterations. I really liked it so far.
Referral link below to buy a Tesla car with benefits.
I was fortunate to have test driven the car — the acceleration is insane. It also uses premium materials for sports seats and has great sound insulation.
New Model 3 Performance launching today 🏎️
→
0-60 mph in 2.9
510 hp / 741 Nm
163 mph top speed
—
Performance-tuned chassis
Same quiet & comfortable cabin plus bespoke chassis hardware for improved stiffness and higher performance baseline.
More power,
Another perk for Tesla EV is its super low maintenance. The only time I sent my car to maintenance over the past 5 years is for a simple tire rotation.
Got a taste of
@Tesla
's FSD v12.3.4 last night. By no means flawless, but the human-like driving maneuvers (with no interventions) delivered a magical experience. Excited to witness the recipe of scaling law and data flywheel for full autonomy show signs of life in real products.
What useful tasks can a humanoid robot do for a factory?
This newly released video from Tesla shows Optimus (Tesla Bot) sorting battery cells, a seemingly simple but nuanced task 🤖 🦾🔋
The video also gives a glimpse of how the bot was trained — through human demonstration
Ideas are cheap — that’s something I learned from my PhD experience too.
Often time a researcher sees a paper and says I have thought of the same idea, I could have done this paper! However turning an initial idea to a final product, whether that’s a paper or a business, is the
Beijing is hosting an international auto exhibition this weekend, drawing significant attention to EVs. Although tesla did not participate the exhibit, Elon’s timely presence in the city certainly amplified the buzz.
Wonderful fireside chat between Jim Fan
@DrJimFan
and Percy Laing
@percyliang
on the Future of Foundation Models at
#GTC2024
A few memorable points to myself:
1. Percy is the person who coined the term “foundation models”, or FoMo :) The switch turned on for him in 2020 when
In-depth meditation (3+ day continued practice) is worth a try for everyone.
It takes 2-3 days to get over the drowsiness and to calm our thoughtful mind. Then with good practice one can enter a mode of flow, to experience the true peacefulness. With enough practice, such a
Returning from an experimental ~2 week detox from the internet. Main takeaway is that I didn't realize how unsettled the mind can get when over-stimulating on problems/information (like a stirred liquid), and ~2 weeks is enough to settle into a lot more zen state.
I'm struck by
The evolution of deep learning frameworks never stops.
My own journey started with Matlab, Theano, Caffe, torch, and then went on to Tensorflow, PyTorch and JAX:
2013-2014: Matlab. Studying Andrew Ng’s early day DL tutorials and rotating in Andrew’s lab working on NN
# CUDA/C++ origins of Deep Learning
Fun fact many people might have heard about the ImageNet / AlexNet moment of 2012, and the deep learning revolution it started.
What's maybe a bit less known is that the code backing this winning submission to the
+1 The bitter lesson in AI is that that general methods that leverage computation are ultimately the most effective, and by a large margin. This is due to the quickly falling cost of compute (e.g. Moore’s law, GPU).
Exponential data growth is another contributor that is not
A typical day in Mountain View, California. Riding in FSD following a Waymo car, smoothly passing through a construction zone. Autonomy is closer than you think.
#fsd
#waymo
I’m attending Nvidia GTC next week in person, for the first time ever!
I’m going with three questions in mind:
1. Is there a “Moore’s law” for AI computing? How soon can we reduce the cost of compute by 1000x? What is the bottleneck, is it memory, bandwidth or energy?
2. How
Riding a Waymo robotaxi (rider only, no safety driver!) in San Francisco downtown from the finance district to little Italy. Smooth drive and no disengage. Will post another one with FSD in the next tweet.
#waymo
#robotaxi
#SanFrancisco
This beast 2-meter-tall rack can host 72 GB200 GPUs with 72KW liquid cooling. Estimated to cost $3M (my rough guess) and arriving in Q4 2024.
#GTC2024
#nvidia
#B200
@Trescend
What YOLO training is: When you have limited resources, you couldn’t iterate over all design choices. So you collect all good ideas and trust your intuition, then You Only Live/Train Once for the best configuration.
I think AI agentic workflows will drive massive AI progress this year — perhaps even more than the next generation of foundation models. This is an important trend, and I urge everyone who works in AI to pay attention to it.
Today, we mostly use LLMs in zero-shot mode, prompting
All those “AGI will arrive by 20XX” predictions remind me of the hype time of autonomous driving in 2012-2017, when Google/GM/Ford/Tesla etc. all made various predictions of widespread L4 autonomy by early 2020s.
In reality it’s much easier to show demos and small scale success
With the advent of LLMs, tech companies (product & org) are changing rapidly, or even being completely redefined. This results in a "high frequency trading" mode of top AI talents.
As an individual, you need to know your position in this paradigm shift. Are you the one being
In the era of neural network 3.0: novelty is no longer inventing exotic / special architectures.
NN1.0: the perceptron era. shallow learning.
NN2.0: the deep learning era. all kinds of archs were invented (AlexNet, ResNet, LSTM, PointNet, …, Transformers)
NN3.0: now
When I review an academic paper, I often ask myself: Is the method scalable to leverage more compute and data?
for a modeling work, I consider: significance = metric gains / complexity.
Often times, an over engineered architecture or complex pipeline does not survive this
Waymo has a Research Scientist opening with a focus on computer vision and generative modeling for the application of road understanding/mapping. Preferred background: MS/PhD with top-tier CV/ML publications or/and ML experiences in industry.
Recent AI generative models like stable diffusion and Sora showcase neural compression as a pivotal tech.
Initially I viewed these neural image/video compression models as novel yet impractical due to its much higher computation cost. I couldn’t realize its potential links to
#GTC2024
is on! First impression: it’s like a workshop/tutorial day in an academic conference (CVPR, ICCV) with various tech talks and Jensen’s keynote coming in the afternoon.
Search is a nice way to trade time for accuracy. NN is a “fast” thinker. Search is a “slow” thinker. Fast and slow gets you the best results over all. There is still much to explore before we get AGI.
8 years ago today, AlphaGo beat Lee Sedol in a milestone for AI. Unlike typical neural nets, AlphaGo spent ~1 minute per move improving its policy via search. This boosted its Elo by more than a 1000x bigger model. Even today, nobody has trained a raw NN that is superhuman in Go.
What’s your role in AI? A message to the young audience from Fei-Fei Li as a closing remark of the fireside chat between Fei-Fei and Bill Dally.
#GTC2024
Other key points:
1. Compared to worrying about AI singularity, there are more urgent near term topics such as its impact
Re Q1: For the last 8 years (2016-2024), AI compute has grown 1000x. If this trend continues, we’ll see another 1000x of compute by 2032 — that likely means training the GPT4 within 2 hours (vs the current 90 days with 8000 H100).
How to build the data flywheel for robots?
- I believe it is ultimately by directly observing humans (no extra HW required) + low level control learning with SSL/RL.
- The humanoid form factor is needed to align with human.
- UMI is a great step towards it (lower cost).
Can we collect robot data without any robots?
Introducing Universal Manipulation Interface (UMI)
An open-source $400 system from
@Stanford
designed to democratize robot data collection
0 teleop -> autonomously wash dishes (precise), toss (dynamic), and fold clothes (bimanual)
Key points from the fireside chat between David Luan (Adept CEO) and Bryan (Nvidia VP).
#GTC2024
1. We have used up text tokens (all human content created ever in all languages) but the models are not done. Moving forward we will see increased specialization (e.g. finetuning and
JUST IN: State regulators say Waymo can expand its driverless ride-hailing operations to the Peninsula and Los Angeles, marking a massive commercial expansion that launched last year in San Francisco.
Search 2.0 is personal assistant. Every one of us will be the CEO of our self company and has AI executive assistants serving us.
Such AI driven search is generative. Results are generated instead of retrieved. They are customized to your question and your context (why you ask,
What about generating highly compressed VQ (vector quantized) tokens that are discrete? That makes the video generation problem much more manageable. Some examples are VQVAE tokens used in DALL-E and MAGVITv2 tokens used in VideoPoet.
Modeling the world for action by generating pixel is as wasteful and doomed to failure as the largely-abandoned idea of "analysis by synthesis".
Decades ago, there was a big debate in ML about the relative advantages of generative methods vs discriminative methods for
1. Money is a tool invented by human for value exchange across space and time.
2. Bitcoin is a “better” form of money than gold and fiat currencies.
3. Adoption rate of the better form of money will go up over time.
(This is not financial advice)
Happy International Women’s Day! 🎉
Last week we hosted a fireside chat with four incredible female leaders in tech. Check out the post for takeaways!
Follow the program on LinkedIn or follow me on X for more updates of future events.
@StefanoErmon
Competition is on. In video gen, AR
Model is catching up diffusion. In text gen, now we have diffusion competitive as AR. Wondering what’s your opinion on the end game?
Re Q2: Nvidia continues to lead by co-optimizing HW, system and algorithms. This lets it provide AI computing solutions 1-2 magnitude better than alternatives (including its own previous gen solutions!) in terms of throughput and energy efficiency, which gives Nvidia an edge in
@JieWang_ZJUI
Since the industry is moving so fast. The opportunity cost of doing a PhD is higher than before. Master with research experiences is a good balance. PhD will be for those who value freedom in research or interested in faculty jobs.
When network capacity becomes really big and computers are really cheap, this seems very possible. Probably we also need self learning on network pruning to recycle unused connections (like in human baby brain’s “pretraining”).
It will take a while (5+ years?) to get there but
Beyond Language Models
Byte Models are Digital World Simulators
Traditional deep learning often overlooks bytes, the basic units of the digital world, where all forms of information and operations are encoded and manipulated in binary format. Inspired by the success of next
Interestingly these generative models themselves can be viewed as a compressor of data. E.g. GPT compresses the internet text, as there is a strong tie between predicting next token accurately (less entropy) and efficient compression.
My typical day as a Member of Technical Staff at OpenAI:
[9:00am] Wake up
[9:30am] Commute to Mission SF via Waymo. Grab avocado toast from Tartine
[9:45 am] Recite OpenAI charter. Pray to optimization Gods. Learn the Bitter Lesson
[10:00am] Meetings (Google Meet). Discuss how to
@Ishan345
@Waymo
Good to hear! Progresses are definitely significant. Many thought L4 in San Francisco was impossible in this century, but look we are here in 2024. There are both dramatic over and under predictions
From their tech report/blog: Sora is based on Latent Diffusion with DiT (diffusion transformers). This is in contrast to the recent Google work “VideoPoet” with autoregressive models and discrete tokens. Is it the model or the data that matters most?
Introducing Sora, our text-to-video model.
Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions.
Prompt: “Beautiful, snowy