Embodiment in virtual humans and robots
SMPL as "robot"
02 June 2024 5 minute read
Embodied AI is a hot topic but embodiment is not just for robots, it's also critical for virtual humans. What may be surprising is that there are important synergies between training robots to move and training avatars to behave like real people. Embodiment is about connecting the AI "brain" of an agent to a "body" that can move through the world (real or virtual) to interact with it and other agents. Embodiment grounds AI in the dynamically changing 3D world.
For virtual humans, the most common body is SMPL. SMPL is a parametric 3D model of the human body that encapsulates our shape and movement. It represents our body shape, pose, facial expressions, hand gestures, soft-tissue deformations, and more, all in a compact form of about 100 numbers.
This post explores why I describe SMPL as a "robot".
Virtual humans as robots
Our goal is to create a virtual human that behaves just like a real one. In a virtual world, this is an "embodied AI" that has the ability to perceive its environment, understand the world and the actions of other agents, plan its actions, and execute behaviors that change the environment. In this recent talk at Stanford, I described the idea of a virtual human as a robot in terms of a "3D human foundation agent".
That sounds a lot like a robot. Swap out the SMPL body for a humanoid robot body and swap the virtual world for the real world, and the problems look quite similar.
There are some key differences, however. Our virtual humans need to move like real humans -- their motor control system needs to mimic human motion to be believable. This may be desirable in a physical robot, but it isn't necessary or even always possible.
The second big difference is physics. In the real world of robots, you simply can't ignore physics, whereas in the virtual world you have some flexibility in how much real-world physics you want to model. This makes training the "SMPL robot" easier than training a real robot. Plus, SMPL never breaks down!
SMPL as a universal humanoid
Since SMPL represents all we need to know about human movement, it can serve as something like a "universal language" of behavior. At Meshcapade we often describe it as a "secret decoder ring". You can take many forms of data -- images, video, IMUs, 3D scans, or even text -- as input and translate these into SMPL format. That is you can encode data from the world into SMPL format.
You can then decode this data back out to the same formats or translate it new humanoid characters through retargetting. In graphics, we often retarget SMPL to new game characters (e.g. using the Meshcapade UEFN plugin for Unreal). But you can also retarget human movements to physical robots.
AMASS: The warehouse of human behavior
The first paper from Meshcapade was AMASS -- the world's largest collection of 3D human movement data in a unified format; i.e. SMPL format (actually SMPL-X but that's a detail). We built it because modern AI is data hungry and to learn about human behavior, we need data at scale. It's fair to say that almost all deep learning methods today that model human motion rely on AMASS for training data.
For example, people mine this data to train diffusion models to generate human movement. When you add text labels (see BABEL), you can then condition your generative model of motion on text. Add in speech and gesture (see EMAGE) and you can train full body avatars that are driven purely from speech.
AMASS keeps growing and will soon have over 100 hours of high-quality human motion data. But that's just the beginning. Our goal is to make it the repository of everything humans can do and have ever done.
Learning from humans
In Perceiving Systems and at Meshcapade, we focus on using data like AMASS to train virtual humans. But you can also use it to train robots. For example, OmiH2O uses AMASS and learns to retarget SMPL to a humanoid robot. There are also many examples of using AMASS together to reinforcement learning to train policies that mimic human behavior.
Here's the good bit -- there are now robust methods for estimating SMPL from video in 3D world coordinates (e.g. WHAM). For robotic applications, the 3D model of the human is critical, as is the estimation of the human in 3D space.
This means that one can use video as an input for robot learning and control. One can learn from demonstrations performed in a video that is encoded into SMPL format. Thus, the full system is based on SMPL in the middle with an encoder on the input side and a decoder that retargets on the output side.
SMPL as the "latent space"
Encoder-decoder architectures are common in machine learning. Typically we encode into a latent space. It's called "latent" because we may only know its statistical structure but not what it "means". When you encode into SMPL, it isn't actually latent because the parameters of SMPL are interpretable.
The important thing about a latent space is not that it's "latent", but that it's compact. SMPL is, in a concrete sense, designed to be a minimal representation of humans. The representation factors body shape from pose. The correlations between shape and pose are modeled by "pose corrective" blend shapes. The body shape model is learned using principal component analysis, which is an optimal way of compressing data that is linear and Gaussian. Body shape variations are close to Gaussian. Think about when you first were taught about normal distributions -- the example was probably human height and weight. The space of body poses is mechanically determined by the joints of the body. The recent SKEL version of SMPL (CVPR 2024) precisely models the degrees of freedom in a minimal, biomechanically accurate, way.
Summary
Embodiment is not just for physical robots. Virtual humans are also embodied and thinking of them this way can actually help robotics.
We think of SMPL as a virtual robot. We collect data of human behavior at scale. We use this to learn how humans behave and move. And then we can retarget this behavior to other virtual characters or physical robots. We can also think about SMPL as a "universal language" for articulated human movement -- you can translate data into SMPL format and back out to another embodiment.
The Perceiving Systems Department is a leading Computer Vision group in Germany.
We are part of the Max Planck Institute for Intelligent Systems in Tübingen — the heart of Cyber Valley.
We use Machine Learning to train computers to recover human behavior in fine detail, including face and hand movement. We also recover the 3D structure of the world, its motion, and the objects in it to understand how humans interact with 3D scenes.
By capturing human motion, and modeling behavior, we contibute realistic avatars to Computer Graphics.
To have an impact beyond academia we develop applications in medicine and psychology, spin off companies, and license technology. We make most of our code and data available to the research community.