Can we use wearable devices to collect robot data without actual robots?
Yes! With a pair of gloves🧤!
Introducing DexCap, a portable hand motion capture system that collects 3D data (point cloud + finger motion) for training robots with dexterous hands
Everything open-sourced
How to teach robots to perform long-horizon tasks efficiently and robustly🦾?
Introducing MimicPlay - an imitation learning algorithm that uses "cheap human play data". Our approach unlocks both real-time planning through raw perception and strong robustness to disturbances!🧵👇
How to chain multiple dexterous skills to tackle complex long-horizon manipulation tasks?
Imagine retrieving a LEGO block from a pile, rotating it in-hand, and inserting it at the desired location to build a structure.
Introducing our new work - Sequential Dexterity 🧵👇
Can robots learn hand-eye coordination simply from teleoperated human demonstrations? Our new
#IROS2021
paper presents a novel action space to enable this!
Website:
1/9
1/ Can we improve the generalization capability of a vision-based task planner with representation pretraining?
Check out our RAL paper on learning to plan with pre-trained object-level representation.
Website:
The combination of LLM and VLM shows great potentials in grounding “Where” and “How” to act in 3D observation space. Such capability allows the robot to perform visuomotor manipulation in a zero-shot fashion! Checkout VoxPoser, amazing work led by
@wenlong_huang
at
@StanfordSVL
How to harness foundation models for *generalization in the wild* in robot manipulation?
Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world!
🌐
🧵👇
Introduce 𝐌𝐨𝐛𝐢𝐥𝐞 𝐀𝐋𝐎𝐇𝐀🏄 -- Learning!
With 50 demos, our robot can autonomously complete complex mobile manipulation tasks:
- cook and serve shrimp🦐
- call and take elevator🛗
- store a 3Ibs pot to a two-door cabinet
Open-sourced!
Co-led
@tonyzzhao
,
@chelseabfinn
Finally, let's talk about the learned low-level bimanual manipulation.
All behaviors are driven by neural network visuomotor transformer policies, mapping pixels directly to actions. These networks take in onboard images at 10hz, and generate 24-DOF actions (wrist poses and…
Motion capture gloves, unlike vision-based tracking, are not affected by occlusions during hand-object interactions, perfect for mocap in daily activities. With an RGB-D camera, DexCap reconstructs 3D scenes and aligns motion data, all powered by a mini-PC in the backpack. 2/
We then retarget the mocap data to the robot embodiment. This includes (1) Observation retargeting by switching the camera system from human to robot. (2) Action retargeting by matching fingertip positions with IK. (3) Bridging the visual gap by including robot point clouds. 3/
DexCap: a $3,600 open-source hardware stack that records human finger motions to train dexterous robot manipulation. It's like a very "lo-fi" version of Optimus, but affordable to academic researchers. This isn't teleoperation: data collection is decoupled from the robot…
We hope Sequential Dexterity paves the path for future research on long-horizon dexterous manipulation. Feel free to check out our code!
Website & Paper:
Code:
Work done w/ Yuanpei Chen,
@drfeifei
, and Karen Liu at
@StanfordAILab
.
🤖Joint-level control + portability = robot data in the wild! We present AirExo, a low-cost hardware, and showcase how in-the-wild data enhances robot learning, even in contact-rich tasks. A promising tool for large-scale robot learning & TeleOP, now at !
Thanks Jim's amazing summary of our key insight - the bi-directional optimization for skill chaining. Check out our code () and video () to see how we make it work!
This is "Sequential Dexterity", a neural network that controls a robot arm to build legos given a manual 🤖
To do this task, the robot needs to chain together multiple skills (grasping, re-orienting, pushing, etc.) and execute without compounding error.
I find some very simple…
The core of the system is a learning-based transition feasibility function that progressively finetunes the sub-policies (learned with RL) for enhancing chaining success, which can also be used during skill selection for re-planning from failures and bypassing redundant stages.
Introducing 𝐌𝐨𝐛𝐢𝐥𝐞 𝐀𝐋𝐎𝐇𝐀🏄 -- Hardware!
A low-cost, open-source, mobile manipulator.
One of the most high-effort projects in my past 5yrs! Not possible without co-lead
@zipengfu
and
@chelseabfinn
.
At the end, what's better than cooking yourself a meal with the 🤖🧑🍳
We train a point cloud-based Diffusion Policy with retargeted human mocap data only. The robot controls both hands (46-dim action space) to perform tasks including collecting tennis balls🎾 and packaging objects🎁. All the policies are learned without any teleoperation data. 4/
However, DexCap is not yet ready for tasks that require applying force, as positional data alone is insufficient. Therefore, we introduce DexCap for human-in-the-loop correction during rollouts. Within 30 trials of corrections, our robot can prepare tea🍵 and use scissors✂️. 7/
Using a domain definition language to set multi-task evaluation goals is very scalable and saves a lot of engineering efforts! Kudos to
@yifengzhu_ut
and the team! If you missed it, we've released MimicPlay code and tested on LIBERO a while ago. Check out
We are thrilled to announce LIBERO, a lifelong robot learning benchmark to study knowledge transfer in decision-making and robotics at scale! 🤖 LIBERO paves the way for prototyping algorithms that allow robots to continually learn! More explanations and links are in the 🧵
How to acquire generalizable robotic skills has garnered much attention recently. We invite you to join our CoRL 2023 workshop - Towards Generalist Robots. Don't miss the chance to hear from our amazing speakers and share your insights through paper submission (Before Oct. 16)!
🤖How far are we from 𝐠𝐞𝐧𝐞𝐫𝐚𝐥𝐢𝐬𝐭 𝐫𝐨𝐛𝐨𝐭𝐬?
𝐀𝐧𝐧𝐨𝐮𝐧𝐜𝐢𝐧𝐠 the 1st Workshop on "Towards Generalist Robots" at
#CoRL2023
!
Join us to discuss how to scale up robotic skill learning, with an amazing lineup of speakers!
CfP:
Details 👇
Our workshop on "Overlooked Aspects of Imitation Learning" is happening this Monday June 27 from 11:00am - 12:30pm EST at
#RSS2022
. Welcome to joining us virtual or in-person and don't forget it is EST time zone!
Another amazing work shows the tremendous potential of "collecting robot data without a robot". These portable low-cost systems really pave the way for scaling up data collection. Super looking forward to what further advancements will emerge.
Can we collect robot data without any robots?
Introducing Universal Manipulation Interface (UMI)
An open-source $400 system from
@Stanford
designed to democratize robot data collection
0 teleop -> autonomously wash dishes (precise), toss (dynamic), and fold clothes (bimanual)
We observe that human play data is fast and easy to collect, which also covers diverse behavior and situations. On the other hand, although robot data is slow and limited, it does not have embodiment gaps. MimicPlay is a method designed to combine the best of both worlds. (2/N)
Despite being trained only in simulation with a few task objects, our system demonstrates generalization capability to novel object shapes and is able to zero-shot transfer to a real-world robot equipped with a dexterous hand.
Sequential Dexterity: Chaining Dexterous Policies for Long-Horizon Manipulation
paper page:
Many real-world manipulation tasks consist of a series of subtasks that are significantly different from one another. Such long-horizon, complex tasks highlight…
Waiting for this thread for a long time! Fantastic work by
@tonyzzhao
showing the great potential of this bimanual teleoperation system. The conditional action synthesis makes a lot of sense in handling human demos for long-horizon tasks!
Introducing ALOHA 🏖: 𝐀 𝐋ow-cost 𝐎pen-source 𝐇𝐀rdware System for Bimanual Teleoperation
After 8 months iterating
@stanford
and 2 months working with beta users, we are finally ready to release it!
Here is what ALOHA is capable of:
DexCap is fully portable and can scale up data collection in the wild. By collecting data with multiple objects in diverse environments, the learned policy can generalize to unseen objects for the same task. 5/
MimicPlay is a hierarchical imitation learning algorithm that leverages cheap and non-labeled human play data (10 minutes) for learning the high-level planner and a small amount of robot data (20 demonstrations ~ 20 minutes) for learning a plan-guided low-level controller. (3/N)
Just dropped sim_web_visualizer! 🚀 Transform the way you view simulation environments right from your web browser like Chrome.
Dive into more examples on our Github:
Great to see such natural power grasping and tool-use motions, and super smooth! Now, robots can even play video games with a joystick controller 🤣 Amazing results highlighting the strength of vision+tactile. Congrats
@ToruO_O
!
Imitation learning works™ – but you need good data 🥹 How to get high-quality visuotactile demos from a bimanual robot with multifingered hands, and learn smooth policies?
Check our new work “Learning Visuotactile Skills with Two Multifingered Hands”! 🙌
DexCap enables fast data collection, approximating the speed of natural human motion. Moreover, the collection process does not require costly robot hardware. 6/
Pretty cool results with diffusion model + visuomotor imitation learning! It is always painful to learn from multi-modal demos. Seems like iteriative diffusion policy is a promising direction! congrats
@chichengcc
@SongShuran
What if the form of visuomotor policy has been the bottleneck for robotic manipulation all along? Diffusion Policy achieves 46.9% improvement vs prior StoA on 11 tasks from 4 benchmarks + 4 real world tasks! (1/7)
website :
paper:
Introducing 𝐀𝐋𝐎𝐇𝐀 𝐔𝐧𝐥𝐞𝐚𝐬𝐡𝐞𝐝 🌋 - Pushing the boundaries of dexterity with low-cost robots and AI.
@GoogleDeepMind
Finally got to share some videos after a few months. Robots are fully autonomous filmed in one continuous shot. Enjoy!
Considering how fragile allegro hand is, the sample efficiency of the online learning is truly amazing! Plier cutting is a cool result highlighting the potential of dexterous hands using human tools.
We just released TAVI -- a robotics framework that combines touch and vision to solve challenging dexterous tasks in under 1 hour.
The key? Use human demonstrations to initialize a policy, followed by tactile-based online learning with vision-based rewards.
Details in🧵(1/7)
We especially thank
@kenny__shaw
@anag004
@pathak2206
for open-sourcing the LEAP Hand project. Having a customizable and low-cost dexterous hand benefits our project a lot!
We found MimicPlay significantly outperforms prior methods in performance and sample efficiency. With only 20 robot demonstrations and a planner learned with 10 minutes of human play data (shared across tasks), MimicPlay can perform long-horizon tasks such as baking foods. (6/N)
Visual correspondence paves the way for numerous downstream tasks. Learning form internet-scale video further reveals its full strength! Very interested in trying it out for manipulation and more! Nice work from
@agrimgupta92
🚀
How should we leverage internet videos for learning visual correspondence?
In our latest work we introduce SiamMAE: Siamese Masked Autoencoders for self-supervised representation learning from videos.
web:
paper: 👇🧵
Huge congrats to
@DrJimFan
and
@yukez
on the exciting move! Really enjoyed working with you last year and can't wait to see what this team will achieve!🦾
Career update: I am co-founding a new research group called "GEAR" at NVIDIA, with my long-time friend and collaborator Prof.
@yukez
. GEAR stands for Generalist Embodied Agent Research.
We believe in a future where every machine that moves will be autonomous, and robots and…
I’ll present our work on learning human-robot collaboration in simulation today
@corl_conf
(Wed 5:00-6:00pm GMT, 9:00-10:00am PST). Drop by the poster (Session V Booth 4) and check out our paper () with code () to learn more!
robomimic v0.3 released - the most major upgrade yet! New features:
🧠 New Algorithms (BC-Transformer, IQL)
🤖 Full compatibility with robosuite v1.4 and
@DeepMind
's MuJoCo bindings
👁️ Pre-trained image reps
📈 wandb logging
@weights_biases
try it out:
We further test MimicPlay in more challenging multi-task learning settings, where we found that MimicPlay has the smallest performance drop compared to prior methods. This result highlights MimicPlay's capability to handle diverse tasks within one model. (8/N)
More importantly, after training multiple tasks within one model, MimicPlay is able to generalize to new tasks with unseen temporal compositions. (7/N)
🦾 Our robot hand can rotate objects over 6+ axes in the real-world!
Introducing RotateIt (CoRL 2023), a Sim-to-Real policy that can rotate many objects over many axes, using vision and touch!
Check it out: .
Paper: .
#CoRL2023
Introducing Open-World Mobile Manipulation 🦾🌍
– A full-stack approach for operating articulated objects in open-ended unstructured environments:
Unlocking doors with lever handles/ round knobs/ spring-loaded hinges 🔓🚪
Opening cabinets, drawers, and refrigerators 🗄️
👇…
The high-level planner is first trained as a goal-conditioned policy. It takes the current and goal images from human play data and outputs a latent plan. We also use a KL-loss to minimize the visual gap between human and robot data. (4/N)
OpenAI ChatGPT is excellent creating PyBullet scripts:
"can you create a pybullet script with a ground plane, a box and a sphere on top."+"can you add 10 more boxes on top?"+"can you move the 5th box 0.3 units along the x axis?"+"can you add a quadruped robot next to the boxes?"
In the second step, we freeze the weights of the trained latent planner. The latent planner takes the current and goal images from the robot data and generates a latent plan to train the low-level controller with a plan-guided imitation learning algorithm. (5/N)
@Stone_Tao
Great question! The primary issue at hand is whether we have sufficient data to support end2end. This becomes particularly challenging for tasks with longer horizons, as the accumulation of errors requires an abundance of data to cover diverse scenarios.
Imitation learning for real world problems takes more than new algorithms - check out our RSS21 workshop on Overlooked Aspects of Imitation Learning - consider submitting by May 7th!
Applying imitation learning to real world problems takes more than new algorithms. We are organizing a workshop "Overlooked Aspects of Imitation Learning: Systems, Data, Tasks, and Beyond” at RSS22! Exciting speakers & more to come. Submit by May 7th!
We trained a transformer called VIMA that ingests *multimodal* prompt and outputs controls for a robot arm. A single agent is able to solve visual goal, one-shot imitation from video, novel concept grounding, visual constraint, etc. Strong scaling with model capacity and data!🧵
We introduce W.A.L.T, a diffusion model for photorealistic video generation. Our model is a transformer trained on image and video generation in a shared latent space. 🧵👇
How can robot manipulators perform in-home tasks such as making coffee for us? We introduce VIOLA, an imitation learning model for end-to-end visuomotor policies that leverages object-centric priors to learn from only 50 demonstrations!
Humanoids w legs and anthropomorphic hands will never beat this robot in cost, speed, maintenance or reliability in the multi-trillion dollar warehouse/manufacturing market where climbing stairs is unnecessary.
Customers will choose faster, lower cost & more reliable every time.
Humans learn to collaborate with others through experiences. However, it would take countless human hours to teach robots how to collaborate through trial-and-error. We ask: is it possible to teach human-robot collaboration skills through human-human collaboration demos?
3/8
Having a robot assistant at home that could seamlessly assist us with daily activities is a long-sought dream. To achieve this goal, the robot needs to recognize and react to their human partners’ intentions on-the-fly.
2/8
In this work, we take a step forward to improve the generalization capability of traditional imitation learning algorithms by introducing a novel human-like hand-eye coordination action space. We hope it can inspire future studies on generalizable offline policy learning.
8/9
@kevin_zakka
Great question! We tried using human demos as task video prompts, which works for short-horizon subgoals such as opening an oven or turning off a lamp. However, it encountered difficulty with longer-horizon tasks due to a mismatch in motion speed between human and robot.
@hameleoned
Awesome catch! We find lots of cases that human hand and the robot gripper need to be used in very different ways to accomplish the same task (e.g., open an oven). Such embodiment gap between human and robot makes it hard to learn low-level control solely from human data.
@AndreTI
@olivercameron
Great point! We are paying close attention to human/robot motion synthesis with diffusion. The key challenge is how to ground such generated behavior to the physical world/objects. Let's say the hand needs to reach the 3D location of a door handle in the correct pose to open it.
6/ With the pre-trained representation, a state transition model is learned to predict the skill effects. The key insight is: the representation summarizes the common features of the objects from the same category (e.g. foods are cookable), which could transfer to new instances.
@allenzren
Thanks for bringing this up! Q_H and Q_R represent the feature embedding space produced by the image encoders. Yes, we treat all the data as a single distribution, and the human play data we utilized does not have any task labeling.
@allenzren
The KL is trying to minimize the distribution gap between human and robot images. I like your idea of dividing the distribution by clustering the data with 3D trajectories and employing KL based on the clustered features. This seems reasonable and might enhance the results!
7/ Finally, we can use a simple tree search algorithm to find the task plans for different symbolic goals. Experiments show that the task planner based on the pre-trained representation could successfully transfer to new objects and new scene layouts.