Physical Intelligence for Humanist Robots

30 June 2025 Michael J. Black 8 minute read

The lesson of modern AI is that data wins and scale is what’s important. AI systems for language, audio, images, and video have all been enabled by the near limitless amount of data on the Internet.

For robotics to follow the successes of these models it needs such a limitless data source. While some will argue that the path is through simulation, I think this is only part of the solution. Physics simulation allows massive scaling for learning a specific task but does not scale well across many tasks as each task needs to be defined in the simulation environment.

I argue that real scale will come from the same place it has come from for every other AI model – the Internet. For robots with a human-like form factor, there is an effectively infinite amount of video that shows humans interacting with the world. Video comes with accessible semantics information about what people are doing and why.

The key question is how to unlock video data to train robots with the physical intelligence of humans. Unlike images and video diffusion models, that take in data in 2D and generate 2D, robots live in 3D. To effectively mimic human behavior, we have to convert the 2D world of human behavior to 3D. For this, we need a “Human Understanding Machine.”

At Meshcapade, we’ve built the technology to “capture” 3D human motion accurately from any video. The result is a compact representation of human movement in terms of the time-varying joint angles that is widely used for training humanoids.

Transferring 3D human motion from video to the kinematic structure of a humanoid. A 3D avatar of a G1 robot is animated in Unreal using the captured 3D human motion.

Humanist robotics

Getting a humanoid robot to do a backflip is a nice trick. It's easy to view this as a mark of progress in the field and it is. But it's not the real challenge.

When something is hard for us, we're impressed when a machine does it. When something is easy for us, we think it must be easy for machines. The opposite is often the case.

The mundane things that we do every day are the hallmark of physical intelligence. Humanoid robots must be masters of the mundane. The problem is that there are an awful lot of “mundane” tasks and traditional training approaches don’t scale. I argue that the easiest way to achieve this is to learn from us by watching us.

I like to call this humanist robotics. The word “humanist” comes from the Italian umanista, meaning "student of human affairs or human nature." The way to make humanoid robots that are highly capable and human-like is to make them students of our behavior.

There are several core principles of physical intelligence for humanist robots:

Motor system: Robots must have a base, pre-trained, motor system that is flexible and robust. It must support a wide range of behaviors and it must keep the robot from situations from which it cannot recover.
Human mimicry: Robots scale their skills by watching humans and then trying to replicate human behavior through imitation learning. For this, they need to be able to track 3D human motion and map this to their own motor system.
Contact: It is not only the human movement that matters but also the human contact with the world, and the effect of this contact on the world. The challenge is for the robot to infer our contacts with the world from video and then replicate these in new scenarios.
Human understanding: To work with us and among us, robots have to see us. To hand us a cup of coffee, they need to make precise predictions about our 3D movement. But they also need to know about social situations and emotions to, for example, know when and how to interrupt a conversation to ask us a question.

3D human and object contact estimated by PICO.

"Conventional" motion and robotic intelligence

Much of the work on achieving physical intelligence relies on physics simulation. Physics is important for learning locomotion and basic motor control. It is also good when paired with data driven methods to increase robustness and generalization. But, for much of what humans do, physics is effectively irrelevant.

Think about how we gesture, shake hands, and touch each other. These are governed by convention not physics. The Germans shake hands, the Italians kiss two cheeks, the Swiss kiss three. These are conventions. How we play sports is influenced by gravity but the rules of the game and how we play it is all about convention.

What it means to move and be human is a combination of the physical, emotional, and conventional. When humanoid motion deviates from human convention, it becomes scary and easily misunderstood. The more human-looking the robot the scarier this becomes (the Uncanny Valley).

While physical simulation is great for training a humanoid to stand and walk, physics won’t help much for teaching a robot how to set a table or how to subtly get someone’s attention. Imagine a robot coming to give a message to a group of people who are deep in conversation. A human would wait until there is a good point in the conversation or would notice when the people turn to look at them. One looks for a social entry point. This is hard to capture with physics or teleoperation.

To train robots to move conventionally we need physically grounded motion that is contextually appropriate and humanist. The solution is mimicry. Robots need to watch us and learn from us. Fortunately, nearly everything humans have ever done has been captured on video. But video is just a sequence of 2D images and robots work in a 3D world. To “mine” all the video in the world, we need to convert it into a 3D form, together with the semantics of the action and scene, so that we can train robots to behave like humans in similar scenarios.

MoCapade3.0: 3D markerless motion capture from video

Skill bank: Learn once and deploy often

Relying on simulation to train a diverse set of skills on diverse robot hardware is time consuming. For robots that have a human-like embodiment, we can take a different approach and learn a model of how humans move and solve tasks. This can be learned once from massive amounts of video data creating a huge "skill bank".

The goal is to mine video at scale to capture 3D motions and contacts, train a generative model of human behavior, and then transfer these skills to different humanoid platforms.

This "learn and transfer" approach has an additional benefit. Every deployed robot can learn new tasks through live human demonstration. These human demonstrations feed back into the skill bank, improving the model of human behavior, which is shared among all robots.

Effectively, the model of human behavior becomes the huge central repository of all human skills. This enables dramatic data scaling.

To implement this, the skill bank reduces human motion to a latent representation. The base motor system has its own latent representation. Transfer involves learning mappings between latent representations. This is related to the concept of Latent Codes as Bridges (LCB), which introduces learnable latent codes that serve as a bridge between LLMs and a base motor system.

Generating reactive humanoid motions by training on a large skill bank. Video shows transfer of generated motions onto CG characters in Unreal.

The Human Understanding Machine

So how do we learn physical intelligence that enables the humanist robot?

1. Synthetic training data for 3D human understanding

Video data with “ground truth” 3D human behavior is key to training robots to see us. Today, it’s possible to create videos of realistic humans with natural behaviors, complex clothing, hair, scenes, lighting, etc., at scale and with perfect ground truth. The BEDLAM dataset is an example of this. It’s purely synthetic and is used to train methods to estimate 3D human motion.

2. Markerless motion capture: Building the skill bank

The ultimate goal is to turn every video of human behavior into training data for humanist robots. Imitation learning requires data of human motions together with the semantics of what people are doing, the context of the scene, and it all needs to be accurate and in 3D. MoCapade3.0 takes video and turns it into training data (see PromptHMR for the method). This includes the 3D pose of all the people in every frame in “3D world coordinates.” Paired semantic information can be extracted from the video using a video foundation model.

3. Learning to solve tasks

Recent work on generative human motion like PRIMAL, builds on a foundational motor system to learn new tasks efficiently. PRIMAL takes small amounts of video data, extracts 3D human motion, and fine-tunes a control network that enables the motor system to generate the demonstrated behaviors.

4. Human-robot interaction

Real-time markerless motion capture will enable robots to see and understand the humans around them. This will support gesture, intent, and emotion recognition as well as 3D pose estimation for tasks like hand-offs. Human trajectories in 3D space are necessary to predict and help avoid collisions.

5. Teleoperation

Real-time markerless 3D human motion capture also supports teleoperation at scale. Given any camera, one can track a human in 3D and convert this into a control signal to guide the robot. Because video-based motion capture is not tied to a laboratory, everyone with a humanoid robot can instruct it using video-based teleoperation, creating data at scale

Real-time 3D human motion capture from video can support training, teleoperation, and interaction. Animation shows the captured motion used to animate a G1 avatar in Unreal.

Beyond physical intelligence to behavioral intelligence

Human understanding involves more than the capture of our 3D motion and contacts. It requires high-level reasoning about what people do and why, and the ability to connect that to motor behavior. To act on conventions, understand the mundane, and to work with humans, robots need more than physical intelligence. Like humans they need situational and emotional intelligence.

To enable this, we've developed several foundation models that can both understand and generate human behavior. For example, ChatHuman is a retrieval-augmented-generation system that exploits 26 different computational tools for the analysis and generation of behavior and combines these with the broad real-world understanding of a large vision-language model.

ChatHuman is an example of a system that ties together 3D human capture, understanding, and generation -- the three pillars of behavioral intelligence. The pieces of the puzzle are coming together and suggest that understanding and modeling our own behavior is a viable path to achieving robotic systems that can work among us.

ChatHuman combines specialized tools with its world knowledge to reason about and generate 3D human behavior.

Michael J. Black

The Perceiving Systems Department is a leading Computer Vision group in Germany.

We are part of the Max Planck Institute for Intelligent Systems in Tübingen — the heart of Cyber Valley.

We use Machine Learning to train computers to recover human behavior in fine detail, including face and hand movement. We also recover the 3D structure of the world, its motion, and the objects in it to understand how humans interact with 3D scenes.

By capturing human motion, and modeling behavior, we contibute realistic avatars to Computer Graphics.

To have an impact beyond academia we develop applications in medicine and psychology, spin off companies, and license technology. We make most of our code and data available to the research community.