Ask me anything -- July 2024
Conversations on X about vision and careers
16 July 2024 Michael J. Black 14 minute read
3DV is a conference focusing on 3D vision that uses social media creatively. The 3DV2025 organizers asked me to "Answer Anything" on X for 24 hours and I recieved many excellent questions. I've compiled the questions and my answers here since they may be interesting to a wider audience. Thank you to 3DV for the opportunity and thank you to the community for all the great questions. I've grouped them in to "Science" and "Career" questions.
Science
Hi Michael, do you believe the understanding of 3D properties can eventually (and automatically) emerge from video understanding, e.g., video diffusion models?
Yes. In fact, I believe LLMs, trained only on language, have an understanding of the 3D world.
People will argue "but this is not how children learn about the world". That's true but irrelevant.
Humans are also incapable of reading everything ever written over all of human history or watching every video ever recorded.
It's hard for us to imagine what you can learn about the world by reading and watching "everything". I would argue that an AI can learn a lot this way.
Hi Michael. What is the most attractive part of creating 3D digital avatars compared to creating 2D avatars?
It is important not to confuse "representation" with "rendering".
Literal 2D avatars would be like cardboard puppets. We live in 3D and our representations of humans need to be 3D.
Representation: Your question is really about whether the 3D representation is explicit (e.g. a mesh like SMPL) or implicit (e.g. weights in a neural network like Sora).
Here, we can weigh the pros and cons of each representation for specific tasks. If you need full control over 3D then explicit meshes are currently a better way to go. But this could change as we introduce better "3D labels" into implicit representations.
Rendering: In the end we always produce pixels. This is where the confusion about 3D and 2D comes in.
We render 2D but the renderer differs depending on the representation. With an implicit model like Sora, the rendering is done by the network.
With an explicit model like SMPL, you can render using traditional graphics, neural graphics like ControlNet, Gaussian splatting, etc.
Once you separate representation from rendering, you have a much clearer design space within which to analyze your problem and choose the right approach.
Do you think that modelling the world via video generation (2D/3D) with enough scale would enable learning physics as an emergent property without needing additional inductive biases? Atleast to the extent that leads to agents capable of functioning effectively in the real world.
It depends what you mean by emergent. I think such systems already learn a lot about the physics of the world but this is *implicit* -- ie emergent but not explicit. Current video generation models can't, for example, tell you what forces are being applied.
To function in the real world, this implicit knowledge might need to be more explicit. To make the implicit explicit requires some sort of labels. This could be done through physics simulation, for example. Right now, such models are trained with text labels and not physics labels. I suspect a small amount of physical labelling will enable the model to relate its implicit knowledge to physical concepts.
That said, people are training robots to do tasks without this explicit physical relationship. I think this is possible because much of what we do in the world is not "strongly governed" by physics. Our bodies, for example, are plenty strong enough to deal with gravity for most activities. We are not physics-limited in our daily life. But when we do extreme activities like gymnastics, lift heavy objects, etc., then our movements are physics limited.
question: where do you see this research field in the next 5 and then 10 years
Are you asking specifically about digital humans or 3D vision more generally?
yes specifically asking for digital humans!
In 5 years, we will have fully realistic virtual humans that can interact with 3D scenes (virtual or captured in AR), have goals, can plan their actions to solve tasks, move realistically, have personalities, talk with you, and are basically indistinguishable from real people.
In 10 years, we will have entirely new industries built around these capabilities. In particular gaming and movies/TV will be radically changed with entirely new forms of entertainment appearing. If you're feeling retro, you might even call it the metaverse ;)
Second question, what’s your favourite movie and video game!
Favorite movie of all time: The Princess Bride.
Favorite video game: Zwift (ok, this is probably a funny answer but I'm a big biker and Zwift gamifies bicycling).
What are some interesting applications/problems at the intersection of ML and Physics?
I have a love-hate relationship with physics.
Imagine trying to simulate a 3D human interacting with objects in the world. Imagine that they drop an object. What happens? Do I really want to get training data of objects being dropped and learn this from examples? No.
Physics gives us the ability to represent properties of the world "for free". So why am I ambivalent about it?
In the real world, physics is "perfect" and everything works out exactly as it should. But in simulation, physics engines are basically a "hack". Imagine a humanoid character trained using physics simulation and RL. What do the feet look like? They look like bricks. You can't produce natural human gait if your feet are bricks.
So to use physics to learn human motion, you need a more realistic foot. So you add all the bones to capture the articulation. But what about the soft tissue deformation? What about modeling the friction coefficients? What if the foot is sweaty? What if you put it in a shoe? Do you need to model the interaction between the foot and the shoe?
To get it "right", you effectively have an infinite regress. Physics is a demanding task master. At some point, you have to cut it off and make an approximation (ie a hack). Where do you do that?
So combining physics and ML seems promising but the devil is in the details.
Hi Michael, what do you think about robust losses in the modern era? Do they have a place in training models, if so, where/when?
Robust losses are used everywhere! L1 is robust. The influence of outliers is constant -- that is, the derivative of L1 is constant so outliers do not have the same impact as with L2 (where the influence increases linearly).
Choosing a robust loss function comes down to several considerations. (1) is it differentiable -- L1 used to be considered not differentiable but we ignore this today, (2) how robust you need to be to outliers -- ie do you need a "re-descending" influence function.
For many problems in computer vision, outliers are already bounded. If you are reconstructing images, for example, your pixel values are in a finite range (0-255). So you can't get gross outliers and don't need more robust functions than L1 (in general).
There are lot of other things to consider but, bottom line, we use robust losses all the time.
What’s the most striking divergence you’ve seen between a CV algorithm and human perception, where the algorithm got the “right” answer?
In computer vision we solve lots of problems that humans can't. For example, we focus on metric reconstruction -- e.g. accurate estimation of depth, surface normals, motion, etc. CV algorithms can do this much better than humans. Or problems like denoising are not really human problems. Humans may be robust to noise but you can stare at a noisy picture all day and it will still be noisy.
The recent interest in generative image and video models is really different from humans. Most of us are unable to generate realistic images -- only artists can do this. We may imagine images in our head but we can't project them out of our eyes!
Artificial intelligence (or artificial vision) is just that -- artificial. There is no reason to assume or even want it to be like human vision. We don't want cars with human vision, we want them with super-human abilities.
Hi Michael, which essential papers would you recommend for a fresh grad student to read to get up to speed with the latest developments in the 3D vision field?
This is the hardest question so far.
For human motion and shape representation and estimation, this historical talk provides an extensive bibliography at the end: https://files.is.tue.mpg.de/black/talks/CVPR2024workshopHistory_web.pdf
But more generally, a student in this area should understand the fundamental 3D representations: point clouds, meshes, voxels, and implicit surfaces (at a minimum). Without a solid foundation of existing representations, a student will be grasping in the dark and may not see opportunities to innovate.
Any list I provide will be very personal and necessarily incomplete. The list in the link above is pretty comprehensive but is missing papers on animation. In particular, I should add the original "subspace deformation" paper that really defines linear blend skinning and its problems. http://scribblethink.org/Work/PSD/PSD.pdf
For essential background, I would recommend the classic graphics textbook: "Computer Graphics: Principles and Practice." It's a great reference text.
Career
Hi Michael, what do you think about the significance of publication quantity and impact when it comes to job hunting and selecting a research direction?
Quantity is nearly irrelevant. In fact, if I see a student with a huge number of papers, it puts me off. Why are they on so many papers? Are they some sort of political operator or are they a scientist?
Impact is hard to measure in the short term. So, for a graduating student, citation counts are pretty unreliable as a metric. If I see that a student has one good paper every year in a top conference, I'm paying attention. Then what matters is the quality of the ideas in those papers, the attention to detail, the quality of the execution, etc.
In academic hiring, the applicant's research vision is key. I also want to know about their character -- are they honorable and will they be a good colleague? Will they be a good advisor to students?
Hi Michael, should a new professor explore new directions in other fields or focus on refining their current expertise during the first couple of years? Also, what's the difference between maintaining a group's research focus and selecting research directions as a PhD student?
This is somewhat personal and will vary between individuals. So I'll just tell you my story.
My PhD advisor worked on optical flow estimation and that's what I did for my PhD. I continued this for my postdoc.
When I started my independent research career at Xerox PARC a senior person asked me if I was going to continue doing what my advisor did? I inferred from this that they thought that was a bad idea. It got me thinking about the bigger picture and what I wanted to achieve.
I was fascinated by the perception of motion and I thought, the most interesting motions are human motions. So I began by applying what I knew about optical flow to the problem of human motion estimation.
The first paper was a collaboration with Yaser Yacoob on tracking and recognizing facial expressions (ICCV 1995, video). I then worked on articulated motion (Cardboard People, Face and Gesture 1996, pdf) . This led me in all kinds of new directions.
So the answer is complex -- as an academic, you want to establish yourself as "the world's expert" on something. This means doubling down on your strengths. But it also means finding your unique niche, which is likely a bit different from your thesis advisor's niche.
Hi Michael, what’s the worst part of being a professor and what’s the best part?
There are lots of best parts :)
But the best best part is the freedom to choose your research problem. A professor doesn't have a boss. If you're driven by your own curiosity, there is no better profession.
Another best part is following the careers and lives of your former students.
Worst? Reviewer 2? Or having to raise funding to do what you want. If you have a crazy idea, you still have to convince grant reviewers and paper reviewers that it's actually a good idea. I found this frustrating.
At Brown I applied with a colleague for NSF funding to build the world's first 4D body scanner. The grant was highly scored but rejected because they thought the world didn't need a 4D body scanner.
So, when I got the opportunity to move to Max Planck, it came with funding for the rest of my career. No need to ask anyone. So I immediately built my 4D body scanner. It turned out to have many uses and now many companies have them.
Hi Michael, you have also been founders for a few companies in the similar research space. What's your thoughts on the pros&cons about "freedom" comparing professors v.s. startup founders?
Good question. I often say that being a professor is like being an entrepreneur. You have to sell your vision, build a team, raise the funding, pivot quickly, etc.
But with academia, you have a much longer time horizon for tolerating failure. You can meander, search, stumble, and fail often before you eventually succeed. Startups are on a much shorter leash.
Both give you "freedom" in some sense. In a startup you're "free" to choose your problem but the market will decide whether you chose well or not.
If you want to succeed, you are likely to change your direction once you understand your customer. So are you still "free"? It depends on how you look at it.
In academia, you are also not completely "free". If you write a paper that your community doesn't understand or appreciate, you'll have trouble publishing or getting funded. Your "customer" is your research field and you may pivot to problems that your customer cares about.
I like both. Academia is great because you have some flexibility to do both -- as long as you have the energy and ability to manage your time well.
Hi Michael, I wish to follow a goal-oriented research style and I usually think about new problems before I start a project. But I’m afraid they may fall into separate projects without a core. How do you design a long roadmap in research? (SMPL seems to be a perfect example.)
Sometimes you don't see the "core" until later when you look back. A student often feels this way until they start writing their thesis and then they come to realize how the different projects fit together into a bigger story. Learning to tell the bigger story is critical for fundraising.
I bet there is a core. There is a reason why you are doing all these projects. What attracts you to them? You are goal-oriented but which goals do you choose and why?
SMPL emerged from many years of work. We built many 3D body models before we got to SMPL. Each one had problems that we only understood once we built it and used it. We'd fix these problems one by one, learning as we went.
Sure, I had a "vision" for using 3D humans in computer vision but SMPL was an evolution.
We were using a model we called BlendSCAPE and I went around to mocap labs in LA pitching our MoSh technique that converted mocap data directly in to 3D human animations.
Everyone said "that looks cool but that body model isn't compatible with any of our tools so we can't use it."
So I went home and thought about this. Could we start with a formulation that would be compatible and make it as good as BlendSCAPE? This set up the design parameters for SMPL. It had to be compatible with existing graphics engines and game technology but it had to be highly realistic. We figured out how to use training data to learn all the pieces of the "standard" model and were surprised when it was even more accurate than BlendSCAPE.
As for a roadmap, while the rough direction is clear at the moment, it only looks like a map when you look backwards in time.
The excitement is inspiring! Did you make conscious efforts to embrace new waves, like deep learning, large-scale models, ... Many senior scientists seem to eventually retire/"give up" when there's one paradigm shift too many to adapt to.
I understand that, as a field changes, some people find that the tools they know and love are no longer valued. I'm not too tied to tools. I'm more tied to problems. So when new tools come along, I ask whether they can help me solve my problem.
With deep learning, we started fairly early trying to apply CNNs to 3D human pose and shape estimation. At first it didn't work well and I thought maybe they were not good at 3D regression task (mostly the were being used for classification at the time). That thinking slowed me down a bit and I wasn't as quick to embrace it as I could have been.
When BERT appeared, I didn't think that language was going to be super useful for vision. It seemed a bit far fetched. But then CLIP and ChatGPT changed my thinking quickly. I was always asking my students to think about ways in which language was going to change everything. I was quicker this time around.
If you're a curious person, then change isn't scary. Ok, maybe it's always a little scary. But it's also exciting.
Hi Michael, do you have any plans to write an autobiography or a book about life lessons in the future?
People have approached me about writing books but I figured that it would take me away from doing research. I'd always choose research. Blog posts are a lower time commitment.
I actually have a draft of a children's book but I've never finalized it or sent it out for review.
I was asked what I want to do after "retirement"?
The original question seems to have disappeared but here is the answer: Research ;)
Really. I love research and wake up every morning excited by new ideas. I hope that never stops. The question is how to support that research post "retirement"?
As many people know, there is a fixed retirement age at Max Planck. This is still a few years away for me. If you're young, maybe you think retirement is the end of work. But I see it as the end of one job and the beginning of another.
I've had a charmed scientific career. In my 30's I started working on human motion estimation at a time when very few people were looking at the problem and nothing worked. I've stuck with this problem for 30 years and now things work. Really well.
It's rare to choose a problem that is just hard enough but not too hard that you can kind of "solve" it in one academic career. But I'm not done.
First, I want to see this work at scale and have a wide impact across society. That's why I've spun off two companies (Body Labs and @meshcapade). Meshcapade is taking this technology the last step and getting it into products across many different industries. I'm super excited to be part of that "chapter" in my career.
Second, I wanted to solve the human motion estimation problem as a foundation to solve other problems. It was not an end to itself. It's a foundational building block. Now that it works, the applications are endless. I want to push applications in human health and have many collaborations there.
But what really excites me is building embodied 3D humans that are intelligent, can see us, and behave like us. I have big dreams about what this unlocks, particularly for entertainment, and I plan to pursue them long after my "retirement" from Max Planck.
The Perceiving Systems Department is a leading Computer Vision group in Germany.
We are part of the Max Planck Institute for Intelligent Systems in Tübingen — the heart of Cyber Valley.
We use Machine Learning to train computers to recover human behavior in fine detail, including face and hand movement. We also recover the 3D structure of the world, its motion, and the objects in it to understand how humans interact with 3D scenes.
By capturing human motion, and modeling behavior, we contibute realistic avatars to Computer Graphics.
To have an impact beyond academia we develop applications in medicine and psychology, spin off companies, and license technology. We make most of our code and data available to the research community.