We introduce W.A.L.T, a diffusion model for photorealistic video generation. Our model is a transformer trained on image and video generation in a shared latent space. 🧵👇
How should we leverage internet videos for learning visual correspondence?
In our latest work we introduce SiamMAE: Siamese Masked Autoencoders for self-supervised representation learning from videos.
web:
paper: 👇🧵
We have released LVIS v0.5 dataset for long tail object detection with 1200+ categories and 700k+ high quality instance segmentation masks
Paper:
Website:
API:
with Ross Girshick and Piotr Dollar
@facebookai
1/ Can we replicate the success of large scale pre-training --> task specific fine tuning for robotics?
This is hard as robots have different act/obs space, morphology and learning speed!
We introduce MetaMorph🧵👇
Paper:
Code:
Excited to share our work on understanding the relationship between environmental complexity, evolved morphology, and the learnability of intelligent control.
Paper:
Video:
w/
@silviocinguetta
@SuryaGanguli
@drfeifei
1/ Can we build video prediction models by masked visual pretraining via Transformer?
We present MaskViT: a simple & parameter efficient method to generate high res. videos in real time.
Paper:
Web: 🧵👇
1/ Excited to share that our work on Deep Evolutionary Reinforcement Learning (DERL): a framework for large scale evolution of embodied agents in physically realistic environments is now published in
@NatureComms
Paper
Video
First LVIS Challenge @ ICCV 2019 is now live. We have updated the paper to include baselines and analysis.
Paper:
Challenge:
with Ross Girshick and Piotr Dollar
@facebookai
Foundation models can dexterously manipulate the world of bits but what about the world of atoms?
Excited to introduce 🤖RoboCat🐈, the first foundation agent:
✅ multi-embodiment
✅ self-improves
✅ vision to action
✅ dexterous & generalist: 100s of tasks + objects
How? 👇🧵
Introducing RoboCat, a new AI model designed to operate multiple robots. 🤖
It learns to solve new tasks on different robotic arms with as few as 100 demonstrations - and improves skills from self-generated training data.
Find out more:
Dataset are extremely hard to get right and often under appreciated. I can only try to imagine the tremendous foresight and hard-work which went into making ImageNet. It's mind boggling that
@drfeifei
was able to envision where the community should go as early as 2006-07!
#CVPR19
2023 was a breakout year for AI video.
In January, there were no public text-to-video models. Now, there are dozens of video gen products and millions of users.
A recap of the biggest developments + companies to watch 👇
3/ Second, for memory and training efficiency, we use a window attention based transformer architecture for joint spatial and temporal generative modeling in latent space.
Photorealistic Video Generation with Diffusion Models
paper:
present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly…
We are excited to announce the LVIS 2021 challenge
@ICCV_2021
. This year we introduce new metrics to better measure progress made by our algorithms in the challenging regime of long tail object recognition. Checkout the challenge hosted
@eval_ai
The LVIS 2021 challenge is live! It uses our
#dataset
that contains 1203 object categories, 160k images, and 2M instance annotations. The deadline to submit your challenge entry is September 27. Learn more about LVIS and the challenge here:
LLMs are extremely powerful and are very good at writing code. However, they lack visual grounding. Exciting work led by
@wenlong_huang
shows how we can combine LLMs + VLMs for robotic manipulation.
How to harness foundation models for *generalization in the wild* in robot manipulation?
Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world!
🌐
🧵👇
Latest project from friends at
@NVIDIAAI
: Voyager. An AI agent based on GPT-4 that plays Minecraft and keeps learning new skills. Congrats
@guanzhi_wang
,
@DrJimFan
and the whole team!
What if we set GPT-4 free in Minecraft? ⛏️
I’m excited to announce Voyager, the first lifelong learning agent that plays Minecraft purely in-context. Voyager continuously improves itself by writing, refining, committing, and retrieving *code* from a skill library.
GPT-4 unlocks…
Great use of AI for video editing. Dynamic ad creation using
@Rephrase_AI
+ hyper local targeting based on pincodes of users = Having one of the most popular Bollywood star
@iamsrk
being a brand ambassador for your local business!
Video:
First up:
@agrimgupta92
from Stanford is interested in how the form of a machine changes its ability to learn, shifting the focus away from learning algorithms operating by themselves and onto learning combined with a kind of bodily evolution.
#EmTechDigital
2/ MetaMorph is based on the insight that robot
morphology is just another modality on which we can condition the output of a Transformer.
We process an arbitrary robot by creating a 1D sequence of tokens corresponding to depth first traversal of its kinematic tree.
People say datasets are opium for AI researchers. I like the original analogy better. Data is Oil. Both are unsustainable in the long run but nothing else works right now.
Came across this discussion from 1984: More than 35 years have passed but it still is reads the same if you just replace expert systems by deep learning.
VQA (V+L) was subject to much debate after Alyosha's talk in
#CVPR2019
. Even though I agree with the sentiment that vision is just not there yet I don't think discarding an entire field is wise. The progress VQA benchmark has enabled is undeniable.
VQA performance on a standard benchmark (VQA v2 dataset) has gone up 20% (absolute) in the last ~4 years. You can really tell the difference when interacting with these models! Check the VQA demo out here: .
5/ By predicting a majority fraction of the future frame, SiamMAE learns the notion of object boundaries. This emergent ability is unique and surprising as no loss function operates on the [CLS] token in SiamMAE. We're excited to explore this further!
3/ Large scale simulations allow us to ask interesting scientific questions like what is the relationship between morphological intelligence and environmental complexity. We find that agents evolved in more complex environments are able to learn new tasks faster and better.
2/ First, we note that images are (approximately) isotropic. However, the temporal dimension is special and not all spatio-temporal orientations are equally likely. Hence, symmetric masking across temporal dimension might be sub-optimal!
pycls is a high-quality, high-performance codebase for image classification research. It can also serve as a great starting point for projects not necessarily on image classification.
Code:
by
@facebookai
1/2 We released LVIS v1.0 dataset for long tail object detection with 1200+ categories and 2M+ high quality instance seg masks on 160k images
Paper:
API:
Website:
with Ross and Piotr
#CVPR2020
@facebookai
Task specification is a challenging problem in robotics. We introduce VIMA: A transformer based model which can perform *any* task as specified by multimodal prompts. Intuitive & multimodal task interface is going to be essential for useful embodied agents.
We trained a transformer called VIMA that ingests *multimodal* prompt and outputs controls for a robot arm. A single agent is able to solve visual goal, one-shot imitation from video, novel concept grounding, visual constraint, etc. Strong scaling with model capacity and data!🧵
How to chain multiple dexterous skills to tackle complex long-horizon manipulation tasks?
Imagine retrieving a LEGO block from a pile, rotating it in-hand, and inserting it at the desired location to build a structure.
Introducing our new work - Sequential Dexterity 🧵👇
Evaluation of modern generative models is challenging. Check out HEIM: amazing work led by
@tonyh_lee
@michiyasunaga
@chenlin_meng
. A new benchmark for evaluating text to image generation models 🧵👇
Text-to-image models like DALL-E create stunning images. Their widespread use urges transparent evaluation of their capabilities and risks.
📣 We introduce HEIM: a benchmark for holistic evaluation of text-to-image models
(in
#NeurIPS2023
Datasets)
[1/n]
3/ We randomly select a pair of frames from videos and use an asymmetric masking strategy: mask a high portion of the future frame (95%) and keep the past frame intact (0%). Frames are processed independently via an encoder and future masked patches are predicted via a decoder.
4/ Moreover, we observe a morphological Baldwin effect, where morphologies rapidly evolve over a few generations to reduce the sample complexity of reinforcement learning, cutting learning times in half in just 10 generations!
5/ We find a mechanistic underpinning for both the morphological Baldwin effect and the emergence of embodied intelligence. DERL finds morphologies which are energy efficient and highly stable which affords the agents the ability to not only survive in their...
People often want to work with the best people in a field. I think if life is like a race then it make sense to work with people who are just ahead of you, splitstream only happens when you are just behind.
4/ MAE features generally require finetuning and perform poorly in zero-shot settings. Asymmetric masking, siamese encoder and our decoder design fixes this. SiamMAE features can be used zero-shot and outperform state-of-the-art self-supervised methods in multiple tasks.
4/ Thanks to iterative decoding, we can now use MaskViT for planning on real robots. In fact our video prediction is up to 512x faster than autoregressive video prediction.
Came across the YouTube channel of
@Lux_Capital
. Some really cool startups are being backed by them. A key difference I noticed in their portfolio was absence of Uber for X or Amazon for Y type startups. Really refreshing.
This is great! Almost all computer vision dataset depend on flicker. This trend which started with PASCAL was later followed by ImageNet, COCO etc. Finally we will have some thing which is not only a different type of image distribution but hopefully not too North America focused
Big news! With help from the
@NSF
, some terrific colleagues (
@neuroMDL
@bjbalas
and Paul MacNeilage) and I are about to start a journey creating the Visual Experience Database: a first-person video database that will characterize how the world actually looks. 1/
1/ How do current advances in transformer architectures and representation learning transfer to the challenging setting of long tail instance segmentation?
Are we close to detecting 1200+ categories? Checkout LVIS Challenge Workshop video:
Super excited to announce the release of Stable Video Diffusion (SVD) -- the first set of video models in the Stable Diffusion series. To start with, we release 14-frame (SVD) and 25-frame image-to-video (SVD-XT) models. The code/weights are already out!
SVD:…
2/ Our approach is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. We find that this simple de-coupling greatly improves the training speed without sacrificing quality.
@johnowhitaker
Non-autoregressive decoding has shown great results with mask-predict in NLP, MaskGIT in images , and MaskViT in videos . The key idea is iterative refinement!
2/ DERL closely mimics the intertwined processes of evolution and learning and creates embodied agents exploit the passive physical dynamics of agent-environment interactions to survive in their evolutionary environment
5/ Finally, we provide a mechanistic explanation of how MetaMorph is able to control 1000s of morphologies.
MetaMorph simplifies the control problem by learning to activate different motor synergies depending on the input morphology!
4/ Pre-trained controller can zero shot generalize to novel task and morphology combinations. Fine tuning our pre-trained controller is upto 3x more sample efficient than training from scratch on novel tasks.
Are you looking for more realistic and challenging environments to train your robots? Would you one day like a robot which can free you from household chores? Wondering how current RL algorithms perform in this challenging setting? Come join us at
#ICCV2021
#ICCV2021
Join us this Sunday Oct 17 13-18 EDT @ BEHAVIOR Workshop: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments. We feature 7 world-renowned speakers in CV, embodied AI, and robotics:
@leto__jean
@chelseabfinn
@hyogweon
& more
1/ You don't need if you are interested in ML engeering roles. If you care about doing independent research it is almost impossible now to get that freedom at a company without a PhD.
@chipro
Can we say once and for all that you DON'T need MS/PhD to do machine learning? If you're interested in a company, build up your portfolio and apply (or get people to refer you)! No tech company would pass up someone who has won Kaggle competitions or amazing GitHub repos.
What should we learn from videos to accelerate robot learning? Key idea: learn high level planning from in domain human videos and low level skills from robot demonstrations. Really impressive results. Congrats
@chenwang_j
and team!
How to teach robots to perform long-horizon tasks efficiently and robustly🦾?
Introducing MimicPlay - an imitation learning algorithm that uses "cheap human play data". Our approach unlocks both real-time planning through raw perception and strong robustness to disturbances!🧵👇
3/ Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function.
2/ It was a bit better in the early days when you had less people and the companies were still trying to figure out the right structure of the research labs.
When we were working on modelling human behavior for social navigation in early 2017 we were not satisfied with the available datasets. Check the new datset from
@StanfordSVL
Of course it would not have been possible without an amazing group of people who shared the vision Jia Deng
@RichardSocher
@lijiali_vision
Kai Li and made it possible.
Every day we have a new LLM paper but web search still sucks! Having trouble answering simple questions in 2022. Interestingly 4th link does have the correct age highlighted 🤷♂️. Also checked
@YouSearchEngine
same failure mode :/
Hopefully LLMs disrupt search soon!
3/ Our pre-trained policy zero-shot generalizes to 1000s of variations in dynamics and kinematic parameters and even completely unseen morphologies. This graph below shows zero shot performance 👇
Any time I feel smart, I remember that I am hesitant to watch new movies because they take too much time yet I gleefully chain-watch videos on YouTube.
Happy to announce DreamFusion, our new method for Text-to-3D!
We optimize a NeRF from scratch using a pretrained text-to-image diffusion model. No 3D data needed!
Joint work w/ the incredible team of
@BenMildenhall
@ajayj_
@jon_barron
#dreamfusion
Congratulations
@DeepMind
@demishassabis
. In 2014 people thought the mission statement of Deepmind was absolutely crazy "solving intelligence" and using it to solve other challenges. It's amazing what we can achieve with current capabilities. Future looks promising!
The
#AlphaFold
2 papers on the methods and human proteome predictions are out today in hard copy in
@Nature
! A really proud moment to see our work featured with a fantastic image on the front cover of the issue:
Code and pretrained models for our
#CVPR2018
paper on generating images from scene graphs is now available! A step toward creating images with fine-grained control over visual content. With
@agrimgupta92
and
@drfeifei
2/2 LVIS v1.0 will be used for the Joint COCO and LVIS Workshop in ECCV 2020. Please see the section about best practices if you use LVIS in your research.
@maxjaderberg
Congratulations on the release! Both the environment and the learnt behaviors are fascinating. Really like the procedural generation of environments which encompasses both cooperative and competitive games!
"Biology is far too complex and messy to ever be encapsulated as a simple set of neat mathematical equations. But just as mathematics turned out to be the right description language for physics, biology may turn out to be the perfect type of regime for the application of AI"
Thrilled to announce the launch of a new Alphabet company
@IsomorphicLabs
. Our mission is to reimagine the drug discovery process from first principles with an AI-first approach, to accelerate biomedical breakthroughs and find cures for diseases. Details:
Feature request for prompt: Interesting follow ideas on X. The model should do more than extracting future work snippets. Can potentially be really helpful with the paper as context combined with huge database of papers --> brainstorming parter!
🪐 Introducing Galactica. A large language model for science.
Can summarize academic literature, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more.
Explore and get weights:
.
@StanfordHAI
researchers created a computer-simulated playground where arthropod-like agents dubbed "unimals" (short for universal animals) learn and are subjected to mutations and natural selection.