Director, Max Planck Institute for Intelligent Systems (
@MPI_IS
). Chief Scientist
@meshcapade
. Building 3D digital humans using vision, graphics, and learning.
I get a lot of reviews that say my work is not novel and I bet I'm not alone. It's always frustrating because I see novelty where the reviewer doesn't. Rather than rebut every critique, I've written a blog post to help reviewers think about novelty.
PhD students, don't worry. Technologies, trends, and even whole fields come and go. A PhD makes you an expert in a field but, more importantly, teaches you how to become an expert. Once you know that you can learn anything, you can adapt to major disruptions in your field.
I asked
#Galactica
about some things I know about and I'm troubled. In all cases, it was wrong or biased but sounded right and authoritative. I think it's dangerous. Here are a few of my experiments and my analysis of my concerns. (1/9)
In the LLM-science discussion, I see a common misconception that science is a thing you do and that writing about it is separate and can be automated. I’ve written over 300 scientific papers and can assure you that science writing can’t be separated from science doing. Why? 1/18
The Max Planck Society has pledged to support Ukrainian scientists who have to flee and need a place to work. If you are a computer vision scientist leaving
#Ukraine
, reach out to me.
arXiv can result in a time travel situation where you find out your daughter is also your mother. We put a paper on arXiv. Someone else built on it, cited us, and published before us. Fine. Now reviewers want us to explain our novelty beyond the published paper. Now what?
Failure is the dance partner of success. It can feel insurmountable at times. Here is my story of academic failures -- my anti-CV if you will. I hope people find it useful.
The 5 stages of rebuttal grief.
(1) Denial
The reviewers totally misunderstood my paper. The review process is broken. R1 was clearly a student who has never reviewed before. R2 doesn’t know what they are talking about. R3 hates me.
WHAM defines the new state of the art in 3D human pose estimation from video. By a large margin. It’s fast, accurate, and it computes human pose in world coordinates. It’s also the first video-based method to be more accurate than single-image methods. 1/8
I think the rule that you do not need to cite arXiv papers in CVPR/ICCV/ECCV submissions confuses people. If you build on prior work, then you must cite it. If that work appeared in arXiv or was painted on the sidewalk, it doesn't matter.
So I got this nice award (PAMI Distinguished Researcher). PAMI is the name of a journal and the technical committee that helps run our field. But for me it’s more. It’s a home that’s supported and nurtured my career. A scientist doesn’t make a career on their own. 1/3
If you believe that social media can replace peer review, consider my experience. Every paper I've published has been improved by peer review. I can't say the same for comments on Twitter. I may not always agree with reviewers but they spend hours with my paper, not seconds.
It offers authoritative-sounding science that isn't grounded in the scientific method. It produces pseudo-science based on statistical properties of science *writing*. Grammatical science writing is not the same as doing science. But it will be hard to distinguish. (6/9)
Why dangerous? Galactica generates text that's grammatical and feels real. This text will slip into real scientific submissions. It will be realistic but wrong or biased. It will be hard to detect. It will influence how people think. (5/9)
I applaud the ambition of this project but caution everyone about the hype surrounding it. This is not a great accelerator for science or even a helpful tool for science writing. It is potentially distorting and dangerous for science. (9/9)
I've loved
#ICCV
since my first one in 1990. In this blog post, I reflect on the last 31 years of ICCV and the field of computer vision. Hopefully you enjoy this on the last day of
#ICCV2021
. See you in Paris in 2023!
Summarizing my CVPR reviews: reviewers prefer a poor solution to a new problem over a really good solution to an existing problem. A field needs both. If every paper proposes a new problem, we will never make progress on any of them.
With LLMs for science out there (
#Galactica
) we need new ethics rules for scientific publication. Existing rules regarding plagiarism, fraud, and authorship need to be rethought for LLMs to safeguard public trust in science. Long thread about trust, peer review, & LLMs. (1/23)
Stepping outside my area of expertise, I’m frustrated that Europeans are not being asked to reduce their energy use. In addition to showing solidarity with
#Ukraine
, it would enable the EU to impose stronger sanctions against Russia. 1/4
HAAR is a generative model of 3D hair. Given a text description, HAAR generates an encoding of the 3D hair strands. These are decoded into an animation-ready hair model. This connects text-driven generative models with standard graphics that can be used today, eg in Unreal. 1/5
Multi-modal
#LLMs
understand a lot about humans. But do they understand our 3D pose? We train
#PoseGPT
to estimate, generate, and reason about 3D human pose (
#SMPL
) in images and text. This is the first true foundation model for understanding 3D humans.
Who should be the last/senior author on a paper? How do you decide? What does being last entail? I get these questions a lot and it’s confusing because the last author is often a senior person, running a group & raising money. Do those things determine last authorship? No. (1/7)
This could usher in an era of deep scientific fakes. Alldieck and Pumarola will get citations to papers they didn't write. These papers will then be cited by others in real papers. What a mess this will be. (7/9)
I find myself repeating the same advice to students again & again. So I finetuned Mistral.APR.01 on the last 4 years of my Slack history. Now it's answering questions for me. I can go biking for a week and nobody notices. I think my students may actually be getting better advice.
I’m honored to share the 2020 Longuet-Higgins prize with
@DeqingSun
and Stefan Roth. It is given at
#CVPR2020
for work from
#CVPR
2010 that has withstood the test of time. I’ve written a blog post about the secrets behind “The Secrets of Optical Flow”:
Understanding human behavior requires understanding 3D human contact with the world. To study this, we introduce a dataset (RICH) and a method (BSTRO) that infers 3D contact on the body from a single image.
#CVPR2022
(1/8)
Train your avatars to interact with 3D scenes. We use adversarial imitation learning and reinforcement learning to train physically-simulated characters that perform scene interaction tasks in a natural and life-like manner. Today at
#SIGGRAPH2023
.
Anyone who has taught knows the following is true. You think you understand something until you go to teach it. Explaining something to others reveals gaps in your understanding that you didn’t know you had. Well, writing a scientific paper is a form of teaching. 2/18
This is why I find AR more exciting than VR. It’s the ability to extend human perception beyond the visible spectrum that will literally let us understand the world in new ways.
Today marks my 10th anniversary of living on Germany. It was my honor to co-found the
@MPI_IS
together with
@bschoelkopf
and to build the
@PerceivingSys
department together with amazing students, postdocs, and staff, past and present (). My deepest thanks.
The
#CVPR2024
deadline is approaching and it's time to “Dance with the one who brung ya” - this is the phrase I use before a deadline when your results fall short of your dream (as they often do). It means to accept what you have and make the most of it. 1/4
Nature: 10 reasons to move to Germany as a researcher. All true! They miss some other reasons: 1. vacation and a belief that it is important to take it (even PhD students). 2. Higher education and science are valued by society. Being a Prof. Dr. is cool.
A critical skill for scientists is to know what you know and to know what you don't know. And then admit what you don't know to yourself and others. LLMs like
#ChatGPT
are hugely impressive but to make them useful for science, the ability to say "I don't know" is necessary.
If we are going to have fake scientific papers, we might has well have fake reviews of fake papers. And then we can also have fake letters of reference for fake academics who get promoted to tenure at fake universities. I can then retire as there is nothing left for me to do.
What’s the key enabling technology of the
#metaverse
?
It’s not
#VR
headsets,
#AR
glasses, or
#avatars
with legs. It’s computer vision (CV) and
@Meta
clearly understands this.
We focus on headsets & avatars because they're tangible, visible, artifacts in a way that CV isn’t.
🧵
Summary: science thinking, writing, and doing are inseparable. Focus on story. Write early. Write a shitty first draft. And do yourself a favor: write it yourself. I promise that writing about your science will improve your science. 18/18
It is not the "smartest" people who succeed in science. It takes intrinsic motivation, perseverance, the drive to stay active and hands on, a strong support network of mentors, and the curiosity to keep learning. Nice summary in Nature:
I hear a lot about how "speed" is critical to scientific progress. We are obsessed with speed these days. I think accuracy and correctness are what are really important. In my experience correctness and speed of often inversely correlated. So don't be fooled by speed demons.
Given the outside surface of the human body, can we peer inside and infer the bones? Many methods predict a “skeleton” that is not realistic. With OSSO (
#CVPR2022
), for the first time, we learn to predict a detailed skeleton from external observations. (1/7)
Upgrade your expressive 3D human avatars from
#SMPL
-X to
#SUPR
, our latest and greatest body model. SUPR is trained from 1.2M 3D scans, is more expressive, and includes feet with articulation and compression. Code by
@NeelayShah8
, video by
@AYiannakidis
.
As an advisor, there is nothing better than seeing your students and post docs succeed, grow, and become part of the community. This group is so impressive. I love how they support each other and I love their intellectual curiosity. I’m only sad for the ones who couldn’t come.
This was my first CVPR and nobody told us we were getting an award. Anandan and I skipped the banquet where the prize was awarded. David Kriegman accepted it on our behalf and for the rest of the conference people were congratulating him. Lesson: never skip the awards session!
I'm sure the authors are aware of the dangers. Every generation comes with the fine print "WARNING: Outputs may be unreliable! Language Models are prone to hallucinate text." But Pandora's box is open and we won't be able to stuff the text back in. (8/9)
I always assumed language is harder than vision since it evolved later. Even simple species have vision that allows survival. Thus, I thought we’d solve the “vision problem” before higher-level reasoning. That language is helping us solve the vision problem has been a surprise.
I was honored to accept the Koenderink Prize at
@eccvconf
2022 on behalf of my coauthors Dan Butler, Jonas Wulff, and Garrett Stanley. The prize recognizes the Sintel optical flow dataset paper for standing the test of time. Behind the scenes blog post:
This is an interesting, timely, and important paper. The takeaway is that "recent self-supervised models such as DINOv2 learn representations that encode depth and surface normals, with StableDiffusion being a close second". This contrasts with vision-language models like CLIP,…
Google announces Probing the 3D Awareness of Visual Foundation Models
Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their
I repeat: Easily produced science text that's wrong does not advance science, improve science productivity, or make science more accessible. I like research on LLMs but the blind belief in their goodness does a disservice to them and science. Here is an example from
#ChatGPT
1/5
Computer vision for animals is a growing sub-field that will have a big impact. We have organized a CVPR workshop that brings together amazing keynote speakers. All we need now is you! Please submit your work even if it is preliminary.
.
@thiemoall
publishes in the area (excellent work BTW) so it's on the right track but it has made up this reference. Based on these few tests, I think
#Galactica
is 1) an interesting research project, 2) not useful for doing science (stick with wikipedia), 3) dangerous. (4/9)
While most existing methods that estimate 3D human pose and shape (HPS) do so in camera coordinates, many applications require 3D humans in global coordinates. At
#CVPR2023
, we introduce TRACE, which addresses this using a novel 5D representation. 1/6
ARCTIC is a multi-view dataset at
#CVPR2023
containing ground truth 3D humans interacting with articulated objects. It includes ground truth
#SMPL
-X bodies,
#MANO
hands, calibrated RGB images, ego-centric video, and articulated 3D object shapes. (1/7)
I think about the field of 3D human pose, shape, and motion estimation as having three phases. 1: Optimization. 2: Regression. 3: Reasoning. With
#PoseGPT
, we are just entering phase 3. I summarize the coming paradigm shift in this blog post:
Today is my last day at
#Amazon
. I joined over 4 years ago as a Distinguished Amazon Scholar (20% time) through the acquisition of Body Labs. It has been an amazing experience, I’ve learned a lot, and will always be grateful for the opportunity. (1/8)
These are exciting times. There's a sense that AI will change everything, including how science is done. Implicit in this excitement is the hope that everything will change for the better. Let’s look at that. First, we need to define “better.”
The German Chancellor, Angela Merkel, visited
@MPI_IS
today to hear about
@Cyber_Valley
- virtually of course. Here she is getting a tour of the
@PerceivingSys
capture hall. It was fun, she asked great questions, and the team was awesome!
Only in
#SanFrancisco
does this ad make sense on a bullock at the train station. A key rule of advertising is to know your customer. Nailed it. I'm thinking, "yes, indeed, I do want to save $20M on my next H100 bill".
@jbhuang0604
All good points but let me add one more. Use the work yourself. Build on it. In this way you teach others how it is useful. If it isn’t foundational for you, why would it be for others?
Is it my imagination or are the ads longer than the talks at
#ECCV2020
? It changes the character of the scientific meeting. I love an expo, but let the science be ad-free. Let's call this a worthwhile experiment that failed and let's not do it again.
BEV (
#CVPR2022
) computes all the 3D people in an image in one shot, placing them all appropriately in depth. The key novelty is an imaginary 2D “Bird’s-Eye-View” (BEV) representation that reasons about the body centers in depth. (1/8)
I wanted to build a 4D body scanner at Brown and wrote an NSF proposal with a colleague to fund it. We got high scores for everything but the final analysis was that they didn't think anyone needed 4D body scans. That gave me the kick to move to MPI where I could pursue my dream.
Our highlight in 2013 was the
@PerceivingSys
new state-of-the-art 4D body scanner.
@Michael_J_Black
& his team set up the first version that year. Since then, many prominent people have stood in it. Only recently, we captured the motion of the Vice-President of the EU Commission
Motion: All CVF conferences will have a dataset review process for papers promising a dataset. Acceptance will be conditional on the dataset being ready and reviewed by the camera-ready deadline. Papers that do not comply will be rejected. Comments please:
I entered "Estimating realistic 3D human avatars in clothing from a single image or video". In this case, it made up a fictitious paper and associated GitHub repo. The author is a real person (
@AlbertPumarola
) but the reference is bogus. (2/9)
Is synthetic data “all you need” to train 3D human pose and shape (HPS) regressors? Is the field making progress? What algorithmic decisions matter? To address these questions, we present BEDLAM (
#CVPR2023
highlight paper), a synthetic dataset of 3D humans. 1/11
GenAI systems create realistic images or basic 3D shapes from text. We take a different approach and use AI to control classical 3D graphics models, turning AI into a computer-graphics artist. We demonstrate this approach by generating novel 3D trees and animals. 1/8
@OHilliges
My wife has had
#MECFS
since the 1990's. One doctor told her "You're tired? We're all tired. Go home and have babies." On the other hand, the head of rheumatology at Stanford told her "Don't let anyone tell you that you're not sick." More research is desperately needed.
Avatars are central to the success of the
#metaverse
and
#metacommerce
. We need different
#avatars
for different purposes: accurate
#3D
digital doubles for shopping, realistic looking for
#telepresence
, stylized for fun, all with faces & hands.
@meshcapade
makes this easy. (1/8)
The writing reveals what you don’t know. Years ago, Michal Irani gave me good advice. She said you can write the introduction to your paper long before the science is done and that this helps structure your thinking. 4/18
BITE (
#CVPR2023
) reconstructs 3D dog shape and pose from one image, even with challenging poses like sitting and lying. Such poses result in occlusion and deformation. Key idea: leverage ground contact to better estimate pose and shape. 1/6
Given the interest in my thread about managing research in academia, I thought I would share my thoughts about managing research in industry. What I learned as a manager at Xerox PARC, is that managing research takes guts — this is probably true about managing anything creative.
Realistic 3D human animation is hard. Goal: automate it using only speech. Given a speech signal as input, TalkSHOW (
#CVPR2023
) generates realistic, coherent, and diverse holistic 3D motions, that is, the body motion together with facial expressions and hand gestures. (1/8)
Yesterday was my birthday and the folks at
@meshcapade
made me this wonderful movie. I love fonts and this is the best font ever! I'm going to call it "Avatar". It's
#SMPL
to make me happy.
We've seen rapid progress on generating human motion from text descriptions. But to be really useful, animators need timeline control. With our new work, one can control when multiple actions occur and these actions can even overlap. Great
#Nvidia
internship by
@MathisPetrovich
.
Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation
paper page:
Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from…
Social media, influencers, and science publication: I wrote up my thoughts in this blog post. My take on this may surprise people who know my public opposition to promoting papers that are under review.
IPMAN (
#CVPR2023
) uses intuitive physics to estimate physically-plausible 3D human bodies from images. Existing methods produce 3D humans that “align” well with the image in the camera view but are often physically implausible, leaning, hovering, or penetrating the ground. 1/9
Your paper is teaching your reader about your hypothesis, problem, method, the prior work in the field, your results, and what it all means for future work. When you write up your work and find it challenging, this is typically because you don’t yet fully understand it. 3/18
It's very tempting to write "To the best of our knowledge, we are the first to do X, Y, and Z." But, please Google "X Y Z" first as you may be surprised. Writing it won't make you "first". It will only reveal your lack of "knowledge".
The SUPR model is our latest generative 3D human. SUPR is learned from 1.2M 3D scans, more than any other 3D human. This makes it highly accurate and realistic. Unlike previous models, SUPR includes a realistic foot that deforms with contact. (1/8)
Accurate body shape from a single image? To make this easy, SHAPY (
#CVPR2022
) regresses body shape from an image of a person in normal clothing and any pose. The trick? We use linguistic body shape attributes to learn metrically-accurate shape.
(1/10)
I look a little unhinged in this photo. But despite that, the German National Academy of Sciences
@Leopoldina
decided to make me a member. I'm very grateful because Germany has been good to me and being part of the Academy gives me a chance to give back.
A great thread by
@ericarbailey
got met thinking about managing innovation in industry and academia. I tried bringing processes like scrum to research but it didn't work. Breakthrough ideas requires long periods of "wandering in the woods", with no apparent measurable progress.
For
#CVPR2023
, we have a nice little magic trick. MIME takes 3D human motion capture and generates plausible 3D scenes that are consistent with the motion. Why? Most mocap sessions capture the person but not the scene.
How to create a 3D
#avatar
of yourself using
@LumaLabsAI
to get a 3D scan using their
#NeRF
tech and then using
@meshcapade
's avatar platform to automatically rig it and animate it. Our CTO with some nice dance moves!
I have a modest proposal for
#science
#publication
in the
#ChatGPT
era. We publish code and data to enable reproducibility. If you use ChatGPT to do your science, then include the prompts used in the appendix. Maybe one day people will get tenure and prizes for clever prompts.
To honor 50 years of
#SIGGRAPH
, a committee of graphics leaders assembled a collection of "Seminal Graphics Papers" published by
#ACM
. I'm delighted that the
#SMPL
paper is among them. It's the 6th most cite
#ToG
paper. Thanks to all the users of SMPL!
Then I tried "Accurate estimation of body shape under clothing from an image". It produces an abstract that is plausible but refers to
Alldieck et al. "Accurate Estimation of Body Shape Under Clothing from a Single Image"
Which does not exist. (3/9)
VIBE is the current SOTA method for extracting 3D human meshes from video. The training code is now on-line as is an updated arXiv paper with supplemental details. Nice work
@mkocab_
@athn_nik
!
Emotion is central to human communication. The
#metaverse
won’t be a place anyone wants to be if our
#avatars
don’t accurately convey our emotions. EMOCA (
#CVPR2022
) computes animatable 3D human faces from images that capture rich emotional detail. (1/7)
Text in. Realistic animation out. Tip of the iceberg. More to reveal at
#GDC2024
. This will be my first time at
#GDC
. See you at the
@meshcapade
booth.
Since I proposed the current PAMI-TC media ban and the PR ban that preceded it, I should explain why I think we need them. First, we need to establish our goal. Clearly, we all want to have our papers accepted. This is our personal goal. But our goal, as a community, is wider. 1/
I had my first website in 1995. The hardest part back then was getting an image of myself into the computer. I didn't figure that out until 1996 when I used a Silicon Graphics O2 workstation to digitize a grainy videotape of myself.