Google DeepMind is pushing the boundaries of artificial intelligence by teaching robots to learn through videos, a method reminiscent of human learning. By leveraging the Gemini 1.5 Pro generative AI model, these robots are now capable of absorbing information from videos to navigate and complete tasks effectively.
Google DeepMind’s latest innovation involves training robots using video tours of environments such as homes and offices. This approach mimics how a human intern might learn about a new workspace. The Gemini 1.5 Pro model, equipped with a long context window, enables the AI to process large amounts of information simultaneously. This feature allows the robot to watch a video and gain a comprehensive understanding of its environment.
The training process involves filming a detailed video tour of a space. The robot then ‘watches’ this video to learn about the layout and specific details of the environment. This knowledge enables the robot to complete tasks using both verbal and image inputs, demonstrating a level of interaction with its surroundings that mirrors human behavior.
The effectiveness of this training method is evident in practical tests. Gemini-powered robots operated within a 9,000-square-foot area and successfully followed over 50 different user instructions with a 90 percent success rate. This high level of accuracy highlights the potential for real-world applications of AI-powered robots, from assisting with household chores to performing complex tasks in a work environment.
ALSO READ: Google AI Event: Exploring AI-Powered Search and Info Exploration
One notable capability of the Gemini 1.5 Pro model is its ability to perform multi-step tasks. For instance, the robot can answer questions like whether a specific drink is available by navigating to a refrigerator, visually processing the contents, and returning with an answer. This sequence of actions showcases a deeper level of understanding and execution compared to the current standard of single-step commands for most robots.
Despite these advancements, there are still challenges to address. Currently, it takes up to 30 seconds for the robot to process each instruction, which is significantly slower than manual completion of tasks. Additionally, the complexity of real-world environments poses a greater challenge than controlled test settings.
However, the integration of AI models like Gemini 1.5 Pro into robotics marks a significant leap forward. The potential applications of such technology extend beyond simple tasks. Robots equipped with advanced AI could revolutionize industries such as healthcare, shipping, and janitorial services.
ALSO READ: Google’s Gemini AI: A Multimodal Language Model That Could Challenge GPT-4
While we may not see these robots on the market soon, the research conducted by Google DeepMind paves the way for future innovations. The ability of robots to learn and interact with their environment through videos could lead to more intuitive and capable AI systems.