>>97238335https://storage.googleapis.com/deepmind-media/gemini-robotics/gemini_robotics_report.pdf>The first is Gemini Robotics, an advanced vision-language-action (VLA) model that was built on Gemini 2.0 with the addition of physical actions as a new output modality for the purpose of directly controlling robots.>The second is Gemini Robotics-ER, a Gemini model with advanced spatial understanding, enabling roboticists to run their own programs using Gemini’s embodied reasoning (ER) abilities.The ER model is capable of plotting grasp points and angles, and can map points between 2 views of the same scene. They distilled the ER model into a smaller model that can be inferenced fast enough to be useful.
They do this weird 'split inference' trick where they do most of the inference in the cloud, and then do the decoding locally. Apparently this reduces latency? picrel suggests the decoding considers the most recent robot data. They say there's a query-to-response latency of 250ms, yet they also say that by chunking the actions together, they obtain an 'effective' robot control of 50 Hz. Could this solve the latency bottleneck that Vedal talked about regarding LLMs playing time-sensitive games by making pipelining work?
>robot modalityContrary to what I expected it just generates single-use python snippets using an API provided by the robot. They don't mention modality in the paper.
>Gemini Robotics-ER and Gemini Robotics were trained on datasets comprised of images, text, and robot sensor and action dataI wonder if there are any open source datasets for these specific challenges (e.g. finding grasp points, 3d bounding box estimation). They seem rather specific-purpose (as opposed to regular LLMs which engorge themselves on pure streams of text).
>Example robot control output:banana_name = "banana"
grasp_position, grasp_orientation = robot.get_grasp_position_and_euler_orientation(gripper=Gripper.LEFT,
object_name=banana_name)
robot.move_gripper_to(position=grasp_position, orientation=grasp_orientation, gripper=Gripper.LEFT)
robot.close_gripper(gripper=Gripper.LEFT)