Despite a lot of imaginative fiction most robots still don’t fully understand the world around them at a deep level. They can be programmed to back up when bumping into a chair, but they can’t recognize what a chair is or know that bumping into a spilled soda can will only make a bigger mess. To help researchers to build more intelligent real-world robots Facebook has developed the open-sourced droidlet platform.
Droidlet is designed to be a a modular, heterogeneous embodied agent architecture, and a platform for building embodied agents that sits at the intersection of natural language processing, computer vision, and robotics. It is meant to simplify integrating a wide range of machine learning algorithms in embodied systems and robotics to facilitate rapid prototyping.
People using droidlet can quickly test out different computer vision algorithms with their robot, for example, or replace one natural language understanding model with another. Droidlet enables researchers to easily build agents that can accomplish complex tasks either in the real world or in simulated environments like Minecraft or Habitat.
A family of agents
According to the announcement by Facebook there is still much more work to do — both in AI and in hardware engineering — before robots come close to what is imagined in books, movies, and TV shows. With droidlet, robotics researchers can take advantage of the significant recent progress across the field of AI and build machines that can effectively respond to complex spoken commands like “pick up the blue tube next to the fuzzy chair that Bob is sitting in.”
Rather than considering an agent as a monolith, Facebook considers the droidlet agent to be made up of a collection of components, some of which are heuristic and some learned. As more researchers build with droidlet, they will improve its existing components and add new ones, which others in turn can then add to their own robotics projects.
This heterogenous design makes scaling tractable because it allows training on large data when large data is available for that component. It can also let programmers use sophisticated heuristics when they are available. The components can be trained with static data when convenient (e.g., a collection of labeled images for a vision component) or with dynamic data when appropriate (e.g., a grasping subroutine).
This architecture also enables researchers to use the same intelligent agent on different robotic hardware by swapping out the tasks and the perceptual modules as needed by each robot’s physical architecture and sensor requirements.