I quit PhD (for a day) and opened a boba shop at
@MIT
- Generative Boba! It’s a huge success - right next to our office so all the AI researchers are enjoying it. Checkout our boba diffusion algorithm in the poster to understand why boba generation is so important to
@MIT_CSAIL
!
Introducing Spatial VLM, a Vision-Language Model with 3D Spatial Reasoning Capabilities by
@GoogleDeepmind
. We investigate to what extent synthetic data can help VLMs learn
- 3D relationship
- quantitative distance
- CoT spatial reasoning
- RL reward
(1/6)
How can we ground large language models (LLM) with the surrounding scene for real-world robotic planning?
Our work NLMap-Saycan allows LLMs to see and query objects in the scene, enabling real robot operations unachievable by previous methods.
Link:
1/6
My AI+robotics lecture at MIT is now public! Back in 2023 I talked about how video models will revolutionize robotics, and many other foundation models like LLMs. Now SORA arrives, definitely checkout my lecture on Foundation Models for Decision making at
I am giving a guest lecture at MIT about "Foundation Models for Decision Making". My current plan is LLM (Saycan, RT2, Code as Policy, Toolformer etc) + Video Prediction (Unipy, Video Language Planning). Please let me know if you have additional suggestion!
I am presenting my paper “RaMP: Self-Supervised Reinforcement Learning that Transfers using Random Features” at poster 1427 from 5-7pm at Neurips 2023! Don’t miss it!
Website:
Proudly announcing our ICLR publication by my first undergraduate mentee, Suning! DittoGym is a set of RL environments and algorithms for shape shifting robots like the Pokémon Ditto.
Suning is interviewing for PhD admission right now. Don’t miss an amazing student like him!
Excited to introduce DittoGym @ ICLR, in which we study the control of a neat new kind of robot: soft shape-shifters! This is work done by
@SuningHuan44558
during his visit at my group at MIT, jointly with my student
@BoyuanChen0
!
Project page:
1/n
Tying shoelaces is traditionally a hard problem for robotics. This seems like the first end-to-end policy to do it. This time Tony got us more generalizations in addition to generalizations. Cannot wait to see the tech report!
Introducing 𝐀𝐋𝐎𝐇𝐀 𝐔𝐧𝐥𝐞𝐚𝐬𝐡𝐞𝐝 🌋 - Pushing the boundaries of dexterity with low-cost robots and AI.
@GoogleDeepMind
Finally got to share some videos after a few months. Robots are fully autonomous filmed in one continuous shot. Enjoy!
Wow VLM is opening up a new world for motion planning! This work from my friend Wenlong has quite impressive results by leveraging foundational models.
How to harness foundation models for *generalization in the wild* in robot manipulation?
Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world!
🌐
🧵👇
While imitation learning has a leading edge in robotic manipulation, let’s not forget the works that uses videos models - to potentially achieve zero shot generalization. Dreamer and Unipy are my favorite works along this line.
Llama 3 is trained with over 10 Million human-annotated examples - that's also my guess of how many diverse trajectories you need to collect to get a generalist GPT for robotics after large-scale pretraining.
I have the same feeling. Always remember that GPT is not just a big model, but huge amount of diverse data. For robotics, people must leverage data of other forms as teleoping robot directly hardly scale up. Videos and VR of human activities will be my biggest bet.
We are unlikely to create an “ImageNet for Robotics”. In retrospect, ImageNet is such a homogeneous dataset. Labeled images w/ boxes.
Generalist robot models will be fueled by the Data Pyramid, blending diverse data sources from web and synthetic data to real-world experiences.
Check out our DittoGym poster at Halle B
#33
@ ICLR on Thursday from 10:45 AM to 12:45 PM.
#ICLR2024
If you need your robot🤖 to flexibly retrieve a key🔑 when it has fallen into a narrow slot, be sure to explore our paper!
Let’s think about humanoid robots outside carrying the box. How about having the humanoid come out the door, interact with humans, and even dance?
Introducing Expressive Whole-Body Control for Humanoid Robots:
See how our robot performs rich, diverse,
It is the annual visit day time for grad school admits. If you have a hard time making your choice, maybe my blog will help you! Check out my post “best computer science schools ranked by boba” at
Career update: I am co-founding a new research group called "GEAR" at NVIDIA, with my long-time friend and collaborator Prof.
@yukez
. GEAR stands for Generalist Embodied Agent Research.
We believe in a future where every machine that moves will be autonomous, and robots and
@Ken_Goldberg
@Stanford
Unpopular opinion: I think the answer is obviously true. The more important question is “Can we collect LLM level of large data for robotics” because that’s what made robotics different.
This is so exciting! I was watching NVIDIA Isaac gym and omniverse from the day it was announced during GTC. Now they are already used in cutting edge researches. Can’t wait to see GR00T change the way we work as well. BTW I really hope it will be open-sourced in the future.
Today is the beginning of our moonshot to solve embodied AGI in the physical world. I’m so excited to announce Project GR00T, our new initiative to create a general-purpose foundation model for humanoid robot learning.
The GR00T model will enable a robot to understand multimodal
Introduce 𝐌𝐨𝐛𝐢𝐥𝐞 𝐀𝐋𝐎𝐇𝐀🏄 -- Learning!
With 50 demos, our robot can autonomously complete complex mobile manipulation tasks:
- cook and serve shrimp🦐
- call and take elevator🛗
- store a 3Ibs pot to a two-door cabinet
Open-sourced!
Co-led
@tonyzzhao
,
@chelseabfinn
Can we collect robot data without any robots?
Introducing Universal Manipulation Interface (UMI)
An open-source $400 system from
@Stanford
designed to democratize robot data collection
0 teleop -> autonomously wash dishes (precise), toss (dynamic), and fold clothes (bimanual)
Check out this cool work from my group! Researcher can generate 3d scenes from single image via diffusing 2D novel views. However, it turns out you need a 3D structured forward model of you want high quality 3d scenes.
This clearly scale ups data collection for dexterous task. I am curious about how does the number of demo changes by moving from ego centric camera to point cloud. A lot of science can be done on top of this.
Can we use wearable devices to collect robot data without actual robots?
Yes! With a pair of gloves🧤!
Introducing DexCap, a portable hand motion capture system that collects 3D data (point cloud + finger motion) for training robots with dexterous hands
Everything open-sourced
To that end, we propose NLMap to address two core problems 1) how to maintain open-vocabulary scene representations that are capable of locating arbitrary objects and 2) how to merge such representations within long-horizon LLM planners to imbue them with scene understanding.
3/
This project fully shows the power and potential of AI agents. While I care more about real world robots, seeing such progress in AI agents is quite exciting. Definitely the next frontier of AI research.
Scalable, reproducible, and reliable robotic evaluation remains an open challenge, especially in the age of generalist robot foundation models. Can *simulation* effectively predict *real-world* robot policy performance & behavior?
Presenting SIMPLER!👇
Congrats to Cameron and David! Flow map helps you get rid of the colmap bottleneck. BTW David writes beautiful code so you will definitely like his projects!
Introducing “FlowMap”, the first self-supervised, differentiable structure-from-motion method that is competitive with conventional SfM like Colmap!
IMO this solves a major missing piece for internet-scale training of 3D Deep Learning methods.
1/n
Somebody finally comparing monocular depth estimators the "correct" way: by unprojecting the predicted depth map into 3D points and seeing them from novel views.
Marigold performs the best but the human reconstructions still appear a bit "slanted".
In robotics, data problem is always much bigger than model problem. Glad to see that AI models trained from other sources can be applied to augment the precious data for robotics!
Text-to-image generative models, meet robotics!
We present ROSIE: Scaling RObot Learning with Semantically Imagined Experience, where we augment real robotics data with semantically imagined scenarios for downstream manipulation learning.
Website:
🧵👇
SayCan, a recent work, has shown that affordance functions can use used to allow LLM planners to understand what a robot can do from observed *state*. However, SayCan did not provide scene-scale affordance grounding, and thus cannot reason over what a robot can do in a *scene*.
I am at
@NeurIPSConf
2023 to present my reinforcement paper “Self-Supervised Reinforcement Learning that Transfers using Random Features”!
Let’s catch up everyone!
NLMap builds a natural language queryable scene representation with Visual Language models. An LLM-based object proposal module infers involved objects to query the representation for object availability and location. LLM planner then plans conditioned on such information.
4/6
My colleague Lirui is presenting his famous GemSim (avg score of 8) at ICLR 2024, check it out if you are at the conference! My own paper DittoGym is also accepted to ICLR but unfortunately I cannot attend due to visa issue - our friends will be presenting there!
Excited to share my works on fleet learning, GenSim, PoCo will be presented in
#ICLR
in Vienna this week! This is my first ML conference in Europe. Look forward to meeting old and new friends in person!
@KevinKaichuang
@MIT
@MIT_CSAIL
Exactly why we need boba shop on campus. Checkout this too - Berkeley literally built their engineering building around a boba shop
We combine NLMap with SayCan to show new robotic capabilities NLMap enables in a real office kitchen. NLMap frees SayCan from a fixed list of objects, locations, or executable options. We show 35 tasks that cannot be unachieved by SayCan but is enabled by NLMap.
5/6
My AI+robotics lecture at MIT is now public! Back in 2023 I talked about how video models will revolutionize robotics, and many other foundation models like LLMs. Now SORA arrives, definitely checkout my lecture on Foundation Models for Decision making at
@GoogleDeepMind
Is this gap due to the model itself or lack of relevant data? Our work on Spatial VLM answers this question by lifting 10M images into 3D object-centric point clouds in **metric scale**. We then synthesize 2B spatial reasoning Q&A pairs to finetune a multi-modal LLM.
(3/6)
@GoogleDeepMind
It may surprise you that some of the best multi-modal LLMs struggle with basic spatial concepts like "left-right" or "above-below" in images. This is a stark contrast to humans, who have been intuitively navigating and reasoning about the 3D world since ancient times. (2/6)
@GoogleDeepMind
In fact, it's not just about qualitative relationships - it's about understanding qualitative relationships as well, be it distance or size, much like humans. When integrated with CoT, it also unlocks new potential in embodied AI such as a dense reward annotator for RL.
(5/6)
@hausman_k
Hmm isn’t this still very high level? LLM definitely knows things like the robot needs to avoid collision, but not how. Would love to see VLMs make breakthroughs at low level policy.
@GoogleDeepMind
By scaling up on such synthetic data, Spatial VLM achieves state of the art performance in 3D spatial Q&A, multi-step spatial reasoning compared to the latest multi-modal LLMs, competitive VQA improvement, as well as as many other tasks we show in paper.
(4/6)
Introducing ALOHA 🏖: 𝐀 𝐋ow-cost 𝐎pen-source 𝐇𝐀rdware System for Bimanual Teleoperation
After 8 months iterating
@stanford
and 2 months working with beta users, we are finally ready to release it!
Here is what ALOHA is capable of:
@yukez
🥺Cannot believe that the paper didn’t even mention NLMap(Open-Vocabulary Queryable Scene Representation for real world planning) , we are an early work cited by most of the papers in 4.C and despite the title of section is almost our paper name
Congratulations! I’ve been following your excellent work in 3D generation. Hope you and your team will eventually get the financial return your talent deserves!
📢Thrilled to announce sudoAI (
@sudoAI_
), founded by a group of leading AI talents and me!🚀
We are dedicated to revolutionizing digital & physical realms by crafting interactive AI-generated 3D environments!
Join our 3D Gen AI model waitlist today!
👉
@audrow
I think manufacturing represents an engineering challenge, but it’s nothing comparable to the challenge of embodied intelligence itself. Current definition of AGI seems to be narrowly focused on modalities where data is plentiful, while robot action data is nowhere close.
@peteflorence
Interesting question. If demonstrations are indeed diverse enough, aka with different objects and scenes, you should already see a lot generalization from 10M demos
@chichengcc
@chris_j_paxton
Hard to tell from an accelerated video. my feeling is the grasp and motion planning algorithms back from PR2 days can do this just fine. Would love to hear your opinion on the figure robot demo
@Vardhan_Dongre
LM-Nav and VLMaps will be mentioned along with NLMap-Saycan. Still thinking about CLIP-Fields as I wonder whether it will fit into the theme of "foundational models"