Home Bots & BrainsRobots learn by watching human videos

Robots learn by watching human videos

by Marco van der Hoeven

Robotics researchers are increasingly looking beyond teleoperation as a way to train more general-purpose robots. Skild AI has outlined an approach in which robots learn new skills by observing videos of humans performing tasks, rather than relying primarily on manually guided robot demonstrations.

According to the company, the robotics sector faces a growing data bottleneck. While large language models were trained on massive, internet-scale text datasets, robot learning has largely depended on teleoperation, where human operators directly control robots to generate training data. Skild AI argues that this method does not scale well enough to support foundation models for robotics.

The company identifies two main limitations of teleoperation. First, data diversity is constrained because robots are typically trained in controlled environments such as laboratories or specific deployment sites. This limits exposure to the wide range of real-world situations robots are expected to handle. Second, teleoperation is inherently time-bound: each demonstration takes place in real time, making it difficult to collect data at the scale seen in language or vision models.

As an alternative, Skild AI points to observational learning, a mechanism common in humans. People often learn tasks by watching others perform them, without explicit instructions about forces, trajectories, or muscle activations. The company argues that a comparable strategy can be applied to robots by leveraging the vast amount of existing video data, including first-person recordings and instructional videos available online.

Using video as training data introduces technical challenges. Videos do not directly capture physical signals such as force, torque, or tactile feedback, which are important for robotic control. In addition, there is what researchers refer to as an “embodiment gap”: the physical differences between human bodies and robotic platforms make it non-trivial to translate observed actions into executable robot movements.

Skild AI says its approach, which it describes as “omni-bodied learning,” is designed to address this gap. The company claims its model can be pretrained to acquire new skills primarily from video demonstrations, supplemented by a limited amount of robot-specific data—less than an hour in some cases. By reducing reliance on large-scale teleoperation, Skild AI believes this method could make training robotics foundation models more scalable.

The announcement reflects a broader trend in robotics research, where learning from visual observation is increasingly explored as a way to accelerate skill acquisition and reduce data collection costs. Whether such approaches can generalize reliably across different robot types and real-world conditions remains an active area of research.

Misschien vind je deze berichten ook interessant